Azure Batch Guide
Run large-scale parallel computing with Azure Batch: pools, jobs, tasks, auto-scaling formulas, container support, and low-priority VMs.
Prerequisites
- Basic understanding of batch processing concepts
- Azure account with Batch service permissions
Introduction to Azure Batch
Azure Batch is a managed service for running large-scale parallel and high-performance computing (HPC) workloads in Azure. It provisions and manages pools of virtual machines, installs applications on those VMs, schedules jobs across the pool, and handles retries and failures automatically. You define your compute tasks, and Batch takes care of the infrastructure.
Azure Batch is used for workloads that require massive parallelism: financial risk simulations, media rendering (VFX, animation), scientific modeling (weather, genomics, molecular dynamics), image processing, machine learning training, software testing across configurations, and large-scale ETL. These workloads share a common pattern: they can be divided into independent units of work that run in parallel across many VMs.
This guide covers the complete Azure Batch workflow: creating Batch accounts, configuring compute pools with auto-scaling, defining jobs and tasks, using container-based tasks, managing input/output with Azure Storage, monitoring execution, and optimizing costs with low-priority VMs and auto-scaling formulas.
Azure Batch Pricing
Azure Batch itself is free. You pay only for the underlying compute resources (VMs), storage, and networking consumed by your workloads. Batch supports both dedicated (on-demand) and low-priority (Spot) VMs. Low-priority VMs are available at up to 80% discount compared to on-demand pricing but can be preempted when Azure needs the capacity. Use low-priority VMs for fault-tolerant batch workloads to dramatically reduce costs.
Setting Up a Batch Account
A Batch account is the management entity for pools, jobs, and tasks. Each account has a unique endpoint and access keys for authentication. You can also use Azure AD authentication for managed identity-based access. Associate a Storage account with the Batch account for input/output data, application packages, and task output management.
# Create a resource group
az group create --name rg-batch --location eastus
# Create a storage account for Batch data
az storage account create \
--name batchdatastorage \
--resource-group rg-batch \
--location eastus \
--sku Standard_LRS
# Create the Batch account
az batch account create \
--name productionbatch \
--resource-group rg-batch \
--location eastus \
--storage-account batchdatastorage
# Get the Batch account endpoint and keys
az batch account show \
--name productionbatch \
--resource-group rg-batch \
--query '{Endpoint: accountEndpoint, PoolAllocationMode: poolAllocationMode}' \
--output table
# Set Batch account credentials for CLI
az batch account login \
--name productionbatch \
--resource-group rg-batch \
--shared-key-authCompute Pools
A pool is a collection of VMs (nodes) that execute your tasks. When you create a pool, you specify the VM size, the OS image or container image, the target number of nodes, networking configuration, and auto-scaling settings. Pools can run Linux or Windows VMs, and each pool uses a single VM size (but you can create multiple pools with different sizes for different workload types).
# Create a pool with Linux VMs
az batch pool create \
--id render-pool \
--vm-size Standard_D8s_v5 \
--image canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
--node-agent-sku-id batch.node.ubuntu 22.04 \
--target-dedicated-nodes 0 \
--target-low-priority-nodes 20 \
--account-name productionbatch
# Create a pool with auto-scaling
az batch pool create \
--id processing-pool \
--vm-size Standard_D4s_v5 \
--image canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
--node-agent-sku-id batch.node.ubuntu 22.04 \
--target-dedicated-nodes 0 \
--account-name productionbatch
# Enable auto-scaling on the pool
az batch pool autoscale enable \
--pool-id processing-pool \
--auto-scale-formula '
// Scale based on pending tasks
pendingTaskCount = $PendingTasks.GetSample(1);
activeTaskCount = $ActiveTasks.GetSample(1);
totalTasks = pendingTaskCount + activeTaskCount;
// Each node handles 4 tasks
targetDedicated = min(100, totalTasks / 4);
targetLowPriority = min(200, max(0, (totalTasks / 4) - targetDedicated));
// Scale to zero when idle
$TargetDedicatedNodes = (totalTasks == 0) ? 0 : targetDedicated;
$TargetLowPriorityNodes = (totalTasks == 0) ? 0 : targetLowPriority;
$NodeDeallocationOption = taskcompletion;
' \
--auto-scale-evaluation-interval PT5M \
--account-name productionbatch
# Create a pool with container support
az batch pool create \
--id container-pool \
--vm-size Standard_D4s_v5 \
--image microsoft-azure-batch:ubuntu-server-container:20-04-lts:latest \
--node-agent-sku-id batch.node.ubuntu 20.04 \
--target-dedicated-nodes 5 \
--container-configuration '{
"type": "dockerCompatible",
"containerImageNames": [
"myregistry.azurecr.io/data-processor:latest",
"myregistry.azurecr.io/report-generator:latest"
],
"containerRegistries": [
{
"registryServer": "myregistry.azurecr.io",
"identityReference": {
"resourceId": "/subscriptions/<sub>/resourceGroups/rg-batch/providers/Microsoft.ManagedIdentity/userAssignedIdentities/batch-identity"
}
}
]
}' \
--account-name productionbatch
# List pools and their status
az batch pool list \
--query '[].{ID: id, VMSize: vmSize, Dedicated: targetDedicatedNodes, LowPriority: targetLowPriorityNodes, State: allocationState}' \
--output table \
--account-name productionbatchVM Size Selection
Choose VM sizes based on your workload characteristics. Use D-series for general-purpose compute, F-series for CPU-intensive workloads (rendering, simulations), E-series for memory-intensive workloads (large datasets, caching), NC/ND-series for GPU workloads (ML training, video processing), and HB/HC-series for HPC workloads (tightly coupled MPI). Always benchmark with your actual workload because price-performance varies significantly across VM families.
Start Tasks and Application Packages
A start task runs on each node when it joins the pool. Use start tasks to install software, download data, configure the environment, or set up prerequisites before your tasks run. Application packages provide a versioned mechanism to deploy application binaries to nodes without using start tasks.
# Create a pool with a start task
az batch pool create \
--id etl-pool \
--vm-size Standard_D4s_v5 \
--image canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
--node-agent-sku-id batch.node.ubuntu 22.04 \
--target-dedicated-nodes 5 \
--start-task-command-line "/bin/bash -c 'apt-get update && apt-get install -y python3-pip && pip3 install pandas numpy sqlalchemy azure-storage-blob'" \
--start-task-wait-for-success \
--start-task-resource-files '[{"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/setup.sh", "filePath": "setup.sh"}]' \
--account-name productionbatch
# Upload an application package
az batch application package create \
--application-name data-processor \
--name productionbatch \
--resource-group rg-batch \
--version 2.0.0 \
--package-file ./data-processor-2.0.0.zip
# Set the default version
az batch application set \
--application-name data-processor \
--name productionbatch \
--resource-group rg-batch \
--default-version 2.0.0Jobs and Tasks
A job is a logical grouping of tasks that run on a specific pool. Tasks are the individual units of work: each task runs a command line (or container) on a single node. When you submit a job, Batch schedules its tasks across available nodes, handles retries for failures, and tracks completion status.
# Create a job
az batch job create \
--id daily-etl-2026-03-14 \
--pool-id etl-pool \
--account-name productionbatch
# Add tasks to the job
for i in $(seq 1 50); do
az batch task create \
--job-id daily-etl-2026-03-14 \
--task-id "process-chunk-$i" \
--command-line "/bin/bash -c 'python3 /mnt/batch/tasks/shared/process.py --chunk $i --total 50 --date 2026-03-14'" \
--resource-files "[{"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/process.py", "filePath": "process.py"}]" \
--output-files "[{"filePattern": "output/*.csv", "destination": {"container": {"containerUrl": "https://batchdatastorage.blob.core.windows.net/output", "path": "2026-03-14/chunk-$i"}}, "uploadOptions": {"uploadCondition": "taskCompletion"}}]" \
--account-name productionbatch
done
# Add a task with container execution
az batch task create \
--job-id daily-etl-2026-03-14 \
--task-id "container-task-1" \
--command-line "python /app/process.py --input /data/chunk1.csv" \
--container-settings '{
"imageName": "myregistry.azurecr.io/data-processor:latest",
"containerRunOptions": "--rm -v /mnt/batch/tasks/shared:/data"
}' \
--account-name productionbatch
# Monitor task completion
az batch task list \
--job-id daily-etl-2026-03-14 \
--query '[].{ID: id, State: state, ExitCode: executionInfo.exitCode, StartTime: executionInfo.startTime, EndTime: executionInfo.endTime}' \
--output table \
--account-name productionbatch
# View task output (stdout/stderr)
az batch task file download \
--job-id daily-etl-2026-03-14 \
--task-id process-chunk-1 \
--file-path stdout.txt \
--destination ./task-output/stdout.txt \
--account-name productionbatchTask Dependencies and Job Manager
Tasks can depend on other tasks, creating execution graphs where downstream tasks wait for upstream tasks to complete. Job Manager tasks are special tasks that run first and can programmatically create additional tasks, implementing dynamic workload patterns where the number of tasks is not known in advance.
# Create tasks with dependencies
# First, create the preprocessing task
az batch task create \
--job-id pipeline-job \
--task-id preprocess \
--command-line "/bin/bash -c 'python3 preprocess.py'" \
--account-name productionbatch
# Create processing tasks that depend on preprocessing
for i in $(seq 1 10); do
az batch task create \
--job-id pipeline-job \
--task-id "process-$i" \
--command-line "/bin/bash -c 'python3 process.py --partition $i'" \
--depends-on-task-id-ranges "[{"start": "preprocess", "end": "preprocess"}]" \
--account-name productionbatch
done
# Create an aggregation task that depends on all processing tasks
az batch task create \
--job-id pipeline-job \
--task-id aggregate \
--command-line "/bin/bash -c 'python3 aggregate.py'" \
--depends-on-task-id-ranges "[{"start": "process-1", "end": "process-10"}]" \
--account-name productionbatch
# Create a job with a Job Manager task
az batch job create \
--id dynamic-job \
--pool-id processing-pool \
--job-manager-task-command-line "python3 /mnt/batch/tasks/shared/job_manager.py" \
--job-manager-task-id "job-manager" \
--job-manager-task-resource-files "[{"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/job_manager.py", "filePath": "job_manager.py"}]" \
--account-name productionbatchTask Retry Strategy
Configure retry policies for tasks that may fail due to transient issues (node preemption for low-priority VMs, temporary network errors, resource contention). Set maxTaskRetryCount to 3-5 for fault-tolerant workloads. When using low-priority VMs, tasks may be preempted and re-queued automatically. Design your tasks to be idempotent so retries produce correct results. Use output files with uploadCondition: taskCompletion to ensure outputs are captured even for failed tasks (for debugging).
Data Management with Azure Storage
Most batch workloads need to read input data and write output data. Azure Batch integrates with Azure Blob Storage for input file staging, output file collection, and application package distribution. Resource files download data to nodes before tasks run, and output files upload results after tasks complete.
# Upload input data to Azure Storage
az storage container create \
--name input-data \
--account-name batchdatastorage
az storage blob upload-batch \
--source ./input-files/ \
--destination input-data \
--account-name batchdatastorage
# Generate a SAS token for Batch to access the storage
SAS_TOKEN=$(az storage container generate-sas \
--name input-data \
--account-name batchdatastorage \
--permissions rl \
--expiry $(date -u -v+24H '+%Y-%m-%dT%H:%M:%SZ') \
--output tsv)
# Create a task with resource files (downloaded before task runs)
az batch task create \
--job-id processing-job \
--task-id transform-1 \
--command-line "/bin/bash -c 'python3 transform.py --input data.csv --output results.csv'" \
--resource-files "[
{"httpUrl": "https://batchdatastorage.blob.core.windows.net/input-data/data.csv?$SAS_TOKEN", "filePath": "data.csv"},
{"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/transform.py?$SAS_TOKEN", "filePath": "transform.py"}
]" \
--output-files "[
{"filePattern": "results.csv", "destination": {"container": {"containerUrl": "https://batchdatastorage.blob.core.windows.net/output-data", "path": "results/transform-1"}}, "uploadOptions": {"uploadCondition": "taskSuccess"}},
{"filePattern": "stderr.txt", "destination": {"container": {"containerUrl": "https://batchdatastorage.blob.core.windows.net/logs", "path": "errors/transform-1"}}, "uploadOptions": {"uploadCondition": "taskFailure"}}
]" \
--account-name productionbatchMonitoring and Troubleshooting
Monitor Azure Batch workloads through the Azure portal Batch Explorer, Azure Monitor metrics, and the Batch APIs. Key metrics include task completion rate, node utilization, task failure rate, and pool scaling behavior.
# Get job summary statistics
az batch job task-counts show \
--job-id daily-etl-2026-03-14 \
--account-name productionbatch
# Get pool usage metrics
az batch pool usage-metrics list \
--start-time "2026-03-14T00:00:00Z" \
--end-time "2026-03-14T23:59:59Z" \
--account-name productionbatch
# Get node status in a pool
az batch node list \
--pool-id processing-pool \
--query '[].{ID: id, State: state, RunningTasks: runningTasksCount, IP: ipAddress}' \
--output table \
--account-name productionbatch
# View task failure details
az batch task show \
--job-id daily-etl-2026-03-14 \
--task-id process-chunk-7 \
--query '{State: state, ExitCode: executionInfo.exitCode, FailureInfo: executionInfo.failureInfo, RetryCount: executionInfo.retryCount}' \
--account-name productionbatch
# Reactivate a failed task (retry)
az batch task reactivate \
--job-id daily-etl-2026-03-14 \
--task-id process-chunk-7 \
--account-name productionbatchCost Optimization Best Practices
Batch workloads can be expensive if pools are not properly managed. Follow these practices to minimize costs while maintaining performance and reliability.
| Strategy | Savings | Implementation |
|---|---|---|
| Low-priority (Spot) VMs | Up to 80% | Set targetLowPriorityNodes instead of dedicated |
| Auto-scaling to zero | 100% when idle | Auto-scale formula scales to 0 when no tasks |
| Right-size VM selection | 20-50% | Benchmark and choose smallest sufficient VM |
| Job scheduling | Variable | Run during off-peak hours for better Spot availability |
| Efficient task granularity | 10-30% | Balance task size (not too small, not too large) |
| Pool reuse | Startup time savings | Keep pools running for recurring jobs |
Batch Explorer Tool
Use the open-source Batch Explorer desktop application for visual monitoring and management of Azure Batch accounts. It provides real-time views of pool utilization, task execution, heat maps of node activity, and the ability to SSH into nodes for debugging. Download it from github.com/Azure/BatchExplorer. For automated monitoring, use Azure Monitor alerts on Batch metrics like PoolNodeCount, TaskCompleteEvent, and JobDeleteCompleteEvent.
Azure Batch provides a powerful, cost-effective platform for parallel and HPC workloads. Define your compute pools with appropriate VM sizes and auto-scaling formulas, organize work into jobs and tasks, use Azure Storage for data management, and leverage low-priority VMs for significant cost savings. For complex workflows, combine Batch with Azure Data Factory or Logic Apps for end-to-end pipeline orchestration.
Azure Data Factory GuideAzure Virtual WAN GuideAzure Arc GuideKey Takeaways
- 1Azure Batch is free; you pay only for underlying VM compute, storage, and networking.
- 2Low-priority (Spot) VMs provide up to 80% cost savings for fault-tolerant workloads.
- 3Auto-scaling formulas dynamically adjust pool size based on pending tasks and queue depth.
- 4Container-based tasks run Docker containers with pre-pulled images from Azure Container Registry.
Frequently Asked Questions
How does Azure Batch differ from Azure Functions?
Can Azure Batch use GPUs?
Written by CloudToolStack Team
Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.
Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.