Skip to main content
AzureComputeintermediate

Azure Batch Guide

Run large-scale parallel computing with Azure Batch: pools, jobs, tasks, auto-scaling formulas, container support, and low-priority VMs.

CloudToolStack Team22 min readPublished Mar 14, 2026

Prerequisites

  • Basic understanding of batch processing concepts
  • Azure account with Batch service permissions

Introduction to Azure Batch

Azure Batch is a managed service for running large-scale parallel and high-performance computing (HPC) workloads in Azure. It provisions and manages pools of virtual machines, installs applications on those VMs, schedules jobs across the pool, and handles retries and failures automatically. You define your compute tasks, and Batch takes care of the infrastructure.

Azure Batch is used for workloads that require massive parallelism: financial risk simulations, media rendering (VFX, animation), scientific modeling (weather, genomics, molecular dynamics), image processing, machine learning training, software testing across configurations, and large-scale ETL. These workloads share a common pattern: they can be divided into independent units of work that run in parallel across many VMs.

This guide covers the complete Azure Batch workflow: creating Batch accounts, configuring compute pools with auto-scaling, defining jobs and tasks, using container-based tasks, managing input/output with Azure Storage, monitoring execution, and optimizing costs with low-priority VMs and auto-scaling formulas.

Azure Batch Pricing

Azure Batch itself is free. You pay only for the underlying compute resources (VMs), storage, and networking consumed by your workloads. Batch supports both dedicated (on-demand) and low-priority (Spot) VMs. Low-priority VMs are available at up to 80% discount compared to on-demand pricing but can be preempted when Azure needs the capacity. Use low-priority VMs for fault-tolerant batch workloads to dramatically reduce costs.

Setting Up a Batch Account

A Batch account is the management entity for pools, jobs, and tasks. Each account has a unique endpoint and access keys for authentication. You can also use Azure AD authentication for managed identity-based access. Associate a Storage account with the Batch account for input/output data, application packages, and task output management.

bash
# Create a resource group
az group create --name rg-batch --location eastus

# Create a storage account for Batch data
az storage account create \
  --name batchdatastorage \
  --resource-group rg-batch \
  --location eastus \
  --sku Standard_LRS

# Create the Batch account
az batch account create \
  --name productionbatch \
  --resource-group rg-batch \
  --location eastus \
  --storage-account batchdatastorage

# Get the Batch account endpoint and keys
az batch account show \
  --name productionbatch \
  --resource-group rg-batch \
  --query '{Endpoint: accountEndpoint, PoolAllocationMode: poolAllocationMode}' \
  --output table

# Set Batch account credentials for CLI
az batch account login \
  --name productionbatch \
  --resource-group rg-batch \
  --shared-key-auth

Compute Pools

A pool is a collection of VMs (nodes) that execute your tasks. When you create a pool, you specify the VM size, the OS image or container image, the target number of nodes, networking configuration, and auto-scaling settings. Pools can run Linux or Windows VMs, and each pool uses a single VM size (but you can create multiple pools with different sizes for different workload types).

bash
# Create a pool with Linux VMs
az batch pool create \
  --id render-pool \
  --vm-size Standard_D8s_v5 \
  --image canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
  --node-agent-sku-id batch.node.ubuntu 22.04 \
  --target-dedicated-nodes 0 \
  --target-low-priority-nodes 20 \
  --account-name productionbatch

# Create a pool with auto-scaling
az batch pool create \
  --id processing-pool \
  --vm-size Standard_D4s_v5 \
  --image canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
  --node-agent-sku-id batch.node.ubuntu 22.04 \
  --target-dedicated-nodes 0 \
  --account-name productionbatch

# Enable auto-scaling on the pool
az batch pool autoscale enable \
  --pool-id processing-pool \
  --auto-scale-formula '
    // Scale based on pending tasks
    pendingTaskCount = $PendingTasks.GetSample(1);
    activeTaskCount = $ActiveTasks.GetSample(1);
    totalTasks = pendingTaskCount + activeTaskCount;

    // Each node handles 4 tasks
    targetDedicated = min(100, totalTasks / 4);
    targetLowPriority = min(200, max(0, (totalTasks / 4) - targetDedicated));

    // Scale to zero when idle
    $TargetDedicatedNodes = (totalTasks == 0) ? 0 : targetDedicated;
    $TargetLowPriorityNodes = (totalTasks == 0) ? 0 : targetLowPriority;
    $NodeDeallocationOption = taskcompletion;
  ' \
  --auto-scale-evaluation-interval PT5M \
  --account-name productionbatch

# Create a pool with container support
az batch pool create \
  --id container-pool \
  --vm-size Standard_D4s_v5 \
  --image microsoft-azure-batch:ubuntu-server-container:20-04-lts:latest \
  --node-agent-sku-id batch.node.ubuntu 20.04 \
  --target-dedicated-nodes 5 \
  --container-configuration '{
    "type": "dockerCompatible",
    "containerImageNames": [
      "myregistry.azurecr.io/data-processor:latest",
      "myregistry.azurecr.io/report-generator:latest"
    ],
    "containerRegistries": [
      {
        "registryServer": "myregistry.azurecr.io",
        "identityReference": {
          "resourceId": "/subscriptions/<sub>/resourceGroups/rg-batch/providers/Microsoft.ManagedIdentity/userAssignedIdentities/batch-identity"
        }
      }
    ]
  }' \
  --account-name productionbatch

# List pools and their status
az batch pool list \
  --query '[].{ID: id, VMSize: vmSize, Dedicated: targetDedicatedNodes, LowPriority: targetLowPriorityNodes, State: allocationState}' \
  --output table \
  --account-name productionbatch

VM Size Selection

Choose VM sizes based on your workload characteristics. Use D-series for general-purpose compute, F-series for CPU-intensive workloads (rendering, simulations), E-series for memory-intensive workloads (large datasets, caching), NC/ND-series for GPU workloads (ML training, video processing), and HB/HC-series for HPC workloads (tightly coupled MPI). Always benchmark with your actual workload because price-performance varies significantly across VM families.

Start Tasks and Application Packages

A start task runs on each node when it joins the pool. Use start tasks to install software, download data, configure the environment, or set up prerequisites before your tasks run. Application packages provide a versioned mechanism to deploy application binaries to nodes without using start tasks.

bash
# Create a pool with a start task
az batch pool create \
  --id etl-pool \
  --vm-size Standard_D4s_v5 \
  --image canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
  --node-agent-sku-id batch.node.ubuntu 22.04 \
  --target-dedicated-nodes 5 \
  --start-task-command-line "/bin/bash -c 'apt-get update && apt-get install -y python3-pip && pip3 install pandas numpy sqlalchemy azure-storage-blob'" \
  --start-task-wait-for-success \
  --start-task-resource-files '[{"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/setup.sh", "filePath": "setup.sh"}]' \
  --account-name productionbatch

# Upload an application package
az batch application package create \
  --application-name data-processor \
  --name productionbatch \
  --resource-group rg-batch \
  --version 2.0.0 \
  --package-file ./data-processor-2.0.0.zip

# Set the default version
az batch application set \
  --application-name data-processor \
  --name productionbatch \
  --resource-group rg-batch \
  --default-version 2.0.0

Jobs and Tasks

A job is a logical grouping of tasks that run on a specific pool. Tasks are the individual units of work: each task runs a command line (or container) on a single node. When you submit a job, Batch schedules its tasks across available nodes, handles retries for failures, and tracks completion status.

bash
# Create a job
az batch job create \
  --id daily-etl-2026-03-14 \
  --pool-id etl-pool \
  --account-name productionbatch

# Add tasks to the job
for i in $(seq 1 50); do
  az batch task create \
    --job-id daily-etl-2026-03-14 \
    --task-id "process-chunk-$i" \
    --command-line "/bin/bash -c 'python3 /mnt/batch/tasks/shared/process.py --chunk $i --total 50 --date 2026-03-14'" \
    --resource-files "[{"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/process.py", "filePath": "process.py"}]" \
    --output-files "[{"filePattern": "output/*.csv", "destination": {"container": {"containerUrl": "https://batchdatastorage.blob.core.windows.net/output", "path": "2026-03-14/chunk-$i"}}, "uploadOptions": {"uploadCondition": "taskCompletion"}}]" \
    --account-name productionbatch
done

# Add a task with container execution
az batch task create \
  --job-id daily-etl-2026-03-14 \
  --task-id "container-task-1" \
  --command-line "python /app/process.py --input /data/chunk1.csv" \
  --container-settings '{
    "imageName": "myregistry.azurecr.io/data-processor:latest",
    "containerRunOptions": "--rm -v /mnt/batch/tasks/shared:/data"
  }' \
  --account-name productionbatch

# Monitor task completion
az batch task list \
  --job-id daily-etl-2026-03-14 \
  --query '[].{ID: id, State: state, ExitCode: executionInfo.exitCode, StartTime: executionInfo.startTime, EndTime: executionInfo.endTime}' \
  --output table \
  --account-name productionbatch

# View task output (stdout/stderr)
az batch task file download \
  --job-id daily-etl-2026-03-14 \
  --task-id process-chunk-1 \
  --file-path stdout.txt \
  --destination ./task-output/stdout.txt \
  --account-name productionbatch

Task Dependencies and Job Manager

Tasks can depend on other tasks, creating execution graphs where downstream tasks wait for upstream tasks to complete. Job Manager tasks are special tasks that run first and can programmatically create additional tasks, implementing dynamic workload patterns where the number of tasks is not known in advance.

bash
# Create tasks with dependencies
# First, create the preprocessing task
az batch task create \
  --job-id pipeline-job \
  --task-id preprocess \
  --command-line "/bin/bash -c 'python3 preprocess.py'" \
  --account-name productionbatch

# Create processing tasks that depend on preprocessing
for i in $(seq 1 10); do
  az batch task create \
    --job-id pipeline-job \
    --task-id "process-$i" \
    --command-line "/bin/bash -c 'python3 process.py --partition $i'" \
    --depends-on-task-id-ranges "[{"start": "preprocess", "end": "preprocess"}]" \
    --account-name productionbatch
done

# Create an aggregation task that depends on all processing tasks
az batch task create \
  --job-id pipeline-job \
  --task-id aggregate \
  --command-line "/bin/bash -c 'python3 aggregate.py'" \
  --depends-on-task-id-ranges "[{"start": "process-1", "end": "process-10"}]" \
  --account-name productionbatch

# Create a job with a Job Manager task
az batch job create \
  --id dynamic-job \
  --pool-id processing-pool \
  --job-manager-task-command-line "python3 /mnt/batch/tasks/shared/job_manager.py" \
  --job-manager-task-id "job-manager" \
  --job-manager-task-resource-files "[{"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/job_manager.py", "filePath": "job_manager.py"}]" \
  --account-name productionbatch

Task Retry Strategy

Configure retry policies for tasks that may fail due to transient issues (node preemption for low-priority VMs, temporary network errors, resource contention). Set maxTaskRetryCount to 3-5 for fault-tolerant workloads. When using low-priority VMs, tasks may be preempted and re-queued automatically. Design your tasks to be idempotent so retries produce correct results. Use output files with uploadCondition: taskCompletion to ensure outputs are captured even for failed tasks (for debugging).

Data Management with Azure Storage

Most batch workloads need to read input data and write output data. Azure Batch integrates with Azure Blob Storage for input file staging, output file collection, and application package distribution. Resource files download data to nodes before tasks run, and output files upload results after tasks complete.

bash
# Upload input data to Azure Storage
az storage container create \
  --name input-data \
  --account-name batchdatastorage

az storage blob upload-batch \
  --source ./input-files/ \
  --destination input-data \
  --account-name batchdatastorage

# Generate a SAS token for Batch to access the storage
SAS_TOKEN=$(az storage container generate-sas \
  --name input-data \
  --account-name batchdatastorage \
  --permissions rl \
  --expiry $(date -u -v+24H '+%Y-%m-%dT%H:%M:%SZ') \
  --output tsv)

# Create a task with resource files (downloaded before task runs)
az batch task create \
  --job-id processing-job \
  --task-id transform-1 \
  --command-line "/bin/bash -c 'python3 transform.py --input data.csv --output results.csv'" \
  --resource-files "[
    {"httpUrl": "https://batchdatastorage.blob.core.windows.net/input-data/data.csv?$SAS_TOKEN", "filePath": "data.csv"},
    {"httpUrl": "https://batchdatastorage.blob.core.windows.net/scripts/transform.py?$SAS_TOKEN", "filePath": "transform.py"}
  ]" \
  --output-files "[
    {"filePattern": "results.csv", "destination": {"container": {"containerUrl": "https://batchdatastorage.blob.core.windows.net/output-data", "path": "results/transform-1"}}, "uploadOptions": {"uploadCondition": "taskSuccess"}},
    {"filePattern": "stderr.txt", "destination": {"container": {"containerUrl": "https://batchdatastorage.blob.core.windows.net/logs", "path": "errors/transform-1"}}, "uploadOptions": {"uploadCondition": "taskFailure"}}
  ]" \
  --account-name productionbatch

Monitoring and Troubleshooting

Monitor Azure Batch workloads through the Azure portal Batch Explorer, Azure Monitor metrics, and the Batch APIs. Key metrics include task completion rate, node utilization, task failure rate, and pool scaling behavior.

bash
# Get job summary statistics
az batch job task-counts show \
  --job-id daily-etl-2026-03-14 \
  --account-name productionbatch

# Get pool usage metrics
az batch pool usage-metrics list \
  --start-time "2026-03-14T00:00:00Z" \
  --end-time "2026-03-14T23:59:59Z" \
  --account-name productionbatch

# Get node status in a pool
az batch node list \
  --pool-id processing-pool \
  --query '[].{ID: id, State: state, RunningTasks: runningTasksCount, IP: ipAddress}' \
  --output table \
  --account-name productionbatch

# View task failure details
az batch task show \
  --job-id daily-etl-2026-03-14 \
  --task-id process-chunk-7 \
  --query '{State: state, ExitCode: executionInfo.exitCode, FailureInfo: executionInfo.failureInfo, RetryCount: executionInfo.retryCount}' \
  --account-name productionbatch

# Reactivate a failed task (retry)
az batch task reactivate \
  --job-id daily-etl-2026-03-14 \
  --task-id process-chunk-7 \
  --account-name productionbatch

Cost Optimization Best Practices

Batch workloads can be expensive if pools are not properly managed. Follow these practices to minimize costs while maintaining performance and reliability.

StrategySavingsImplementation
Low-priority (Spot) VMsUp to 80%Set targetLowPriorityNodes instead of dedicated
Auto-scaling to zero100% when idleAuto-scale formula scales to 0 when no tasks
Right-size VM selection20-50%Benchmark and choose smallest sufficient VM
Job schedulingVariableRun during off-peak hours for better Spot availability
Efficient task granularity10-30%Balance task size (not too small, not too large)
Pool reuseStartup time savingsKeep pools running for recurring jobs

Batch Explorer Tool

Use the open-source Batch Explorer desktop application for visual monitoring and management of Azure Batch accounts. It provides real-time views of pool utilization, task execution, heat maps of node activity, and the ability to SSH into nodes for debugging. Download it from github.com/Azure/BatchExplorer. For automated monitoring, use Azure Monitor alerts on Batch metrics like PoolNodeCount, TaskCompleteEvent, and JobDeleteCompleteEvent.

Azure Batch provides a powerful, cost-effective platform for parallel and HPC workloads. Define your compute pools with appropriate VM sizes and auto-scaling formulas, organize work into jobs and tasks, use Azure Storage for data management, and leverage low-priority VMs for significant cost savings. For complex workflows, combine Batch with Azure Data Factory or Logic Apps for end-to-end pipeline orchestration.

Azure Data Factory GuideAzure Virtual WAN GuideAzure Arc Guide

Key Takeaways

  1. 1Azure Batch is free; you pay only for underlying VM compute, storage, and networking.
  2. 2Low-priority (Spot) VMs provide up to 80% cost savings for fault-tolerant workloads.
  3. 3Auto-scaling formulas dynamically adjust pool size based on pending tasks and queue depth.
  4. 4Container-based tasks run Docker containers with pre-pulled images from Azure Container Registry.

Frequently Asked Questions

How does Azure Batch differ from Azure Functions?
Azure Batch is designed for large-scale parallel batch processing that runs containers or executables on pools of VMs for minutes to hours. Azure Functions is designed for short-duration (< 10 min), event-driven processing of individual requests. Use Batch for rendering, simulations, ETL, and HPC. Use Functions for API backends, event processing, and lightweight automation.
Can Azure Batch use GPUs?
Yes. Azure Batch supports GPU-enabled VM sizes including NC-series (NVIDIA Tesla), ND-series (NVIDIA A100), and NV-series (NVIDIA Tesla M60). These are ideal for ML training, video rendering, and scientific simulations. Use container-based tasks with NVIDIA Docker runtime for GPU workloads.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.