Skip to main content
GCPComputeintermediate

GCP Batch Guide

Guide to Google Cloud Batch covering jobs, tasks, container-based workloads, Spot VMs for cost savings, GPU acceleration, multi-step jobs, and Cloud Storage integration.

CloudToolStack Team20 min readPublished Mar 14, 2026

Prerequisites

  • Basic understanding of containerization (Docker)
  • GCP account with Compute Engine API enabled
  • Familiarity with Cloud Storage for data staging

Introduction to Google Cloud Batch

Google Cloud Batch is a fully managed service for running batch processing workloads on Google Cloud. It handles provisioning, scheduling, and managing compute resources automatically, so you can focus on your application code rather than infrastructure management. Cloud Batch is ideal for workloads like scientific simulations, financial modeling, video transcoding, machine learning training, bioinformatics pipelines, and any embarrassingly parallel computation.

Unlike traditional HPC cluster management where you maintain a fleet of VMs and a job scheduler like Slurm, Cloud Batch provisions VMs on demand, runs your tasks, and tears down the infrastructure when the job completes. You only pay for the compute resources consumed during job execution. Cloud Batch supports standard VMs, Spot VMs (up to 91% discount), and GPU-accelerated instances for ML and rendering workloads.

This guide covers job creation, task configuration, container and script-based tasks, Spot VM usage, GPU workloads, job dependencies, and integration with GCP services like Cloud Storage, Artifact Registry, and Pub/Sub.

Cloud Batch Pricing

Cloud Batch itself has no additional charge. You pay only for the underlying Compute Engine VMs, persistent disks, and networking used by your jobs. Using Spot VMs can reduce compute costs by 60-91%. A typical batch job using e2-standard-4 VMs costs about $0.134/hr per VM, or $0.013/hr per VM with Spot pricing.

Core Concepts

Cloud Batch organizes work into jobs, task groups, and tasks. Understanding these concepts is essential for designing efficient batch workflows.

Batch Architecture Components

ConceptDescriptionAnalogy
JobTop-level unit of work containing one or more task groupsA pipeline or workflow
Task GroupCollection of tasks that share the same configurationA stage in a pipeline
TaskA single unit of work (a script or container to run)One iteration of a loop
RunnableA script or container within a task (sequential steps)A command in a shell script
Allocation PolicyRules for VM provisioning (machine type, spot, GPUs)Cluster configuration
Logs PolicyWhere job logs are sentLogging configuration

Creating Your First Batch Job

You can create Cloud Batch jobs using the gcloud CLI, REST API, or client libraries. Jobs are defined as JSON or YAML configurations that specify what to run, where to run it, and how many parallel tasks to execute.

bash
# Enable the Batch API
gcloud services enable batch.googleapis.com

# Create a simple script-based batch job
gcloud batch jobs submit my-first-job \
  --location=us-central1 \
  --config=- << 'EOF'
{
  "taskGroups": [{
    "taskSpec": {
      "runnables": [{
        "script": {
          "text": "echo \"Hello from task ${BATCH_TASK_INDEX} on $(hostname)\"; sleep 10"
        }
      }],
      "computeResource": {
        "cpuMilli": 2000,
        "memoryMib": 2048
      },
      "maxRetryCount": 2,
      "maxRunDuration": "600s"
    },
    "taskCount": 10,
    "parallelism": 5
  }],
  "allocationPolicy": {
    "instances": [{
      "policy": {
        "machineType": "e2-standard-2"
      }
    }]
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}
EOF

# Check job status
gcloud batch jobs describe my-first-job \
  --location=us-central1 \
  --format="table(name,status.state,status.statusEvents)"

# List all jobs
gcloud batch jobs list --location=us-central1

# View job logs
gcloud logging read "resource.type=batch.googleapis.com/Job AND labels.job_uid=JOB_UID" \
  --limit=50 --format="table(timestamp,textPayload)"

Container-Based Batch Jobs

For production workloads, container-based tasks are preferred over script tasks because they provide reproducible environments, dependency isolation, and portability. Cloud Batch can pull container images from Artifact Registry, Container Registry, or Docker Hub.

container-job.json
{
  "taskGroups": [{
    "taskSpec": {
      "runnables": [{
        "container": {
          "imageUri": "us-central1-docker.pkg.dev/MY_PROJECT/my-repo/data-processor:latest",
          "commands": ["--input-bucket", "my-input-bucket", "--output-bucket", "my-output-bucket"],
          "entrypoint": "/app/process"
        }
      }],
      "volumes": [{
        "gcs": {
          "remotePath": "my-input-bucket"
        },
        "mountPath": "/mnt/input"
      }, {
        "gcs": {
          "remotePath": "my-output-bucket"
        },
        "mountPath": "/mnt/output"
      }],
      "computeResource": {
        "cpuMilli": 4000,
        "memoryMib": 8192
      },
      "maxRetryCount": 3,
      "maxRunDuration": "3600s"
    },
    "taskCount": 100,
    "parallelism": 20,
    "taskEnvironments": [{
      "variables": {
        "REGION": "us-central1",
        "BATCH_SIZE": "1000"
      }
    }]
  }],
  "allocationPolicy": {
    "instances": [{
      "policy": {
        "machineType": "e2-standard-4",
        "provisioningModel": "STANDARD"
      }
    }],
    "serviceAccount": {
      "email": "batch-sa@MY_PROJECT.iam.gserviceaccount.com"
    }
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}
bash
# Submit the container-based job
gcloud batch jobs submit data-processing-job \
  --location=us-central1 \
  --config=container-job.json

# Monitor task completion
gcloud batch tasks list \
  --job=data-processing-job \
  --location=us-central1 \
  --format="table(name,status.state)"

# View individual task details
gcloud batch tasks describe \
  "projects/MY_PROJECT/locations/us-central1/jobs/data-processing-job/taskGroups/group0/tasks/0"

Task Index for Data Partitioning

Each task receives the environment variables BATCH_TASK_INDEX (0-based index) and BATCH_TASK_COUNT (total tasks). Use these to partition your input data across tasks. For example, if processing 1 million files with 100 tasks, task 0 processes files 0-9999, task 1 processes 10000-19999, and so on. This pattern enables efficient parallel processing without external coordination.

Using Spot VMs for Cost Savings

Spot VMs offer the same machine types as standard VMs but at a 60-91% discount. The tradeoff is that Google can preempt (reclaim) Spot VMs at any time with a 30-second warning. Cloud Batch handles preemption gracefully by automatically retrying preempted tasks on new VMs, making it an excellent fit for batch workloads.

spot-vm-job.json
{
  "taskGroups": [{
    "taskSpec": {
      "runnables": [{
        "container": {
          "imageUri": "us-central1-docker.pkg.dev/MY_PROJECT/my-repo/simulation:v2",
          "entrypoint": "/app/run-simulation"
        }
      }],
      "computeResource": {
        "cpuMilli": 8000,
        "memoryMib": 32768
      },
      "maxRetryCount": 5,
      "maxRunDuration": "7200s"
    },
    "taskCount": 500,
    "parallelism": 50
  }],
  "allocationPolicy": {
    "instances": [{
      "policy": {
        "machineType": "c2-standard-8",
        "provisioningModel": "SPOT"
      }
    }],
    "location": {
      "allowedLocations": [
        "zones/us-central1-a",
        "zones/us-central1-b",
        "zones/us-central1-c",
        "zones/us-central1-f"
      ]
    }
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Spot VM Best Practices

Design tasks to be idempotent (safe to retry) and checkpoint their progress to Cloud Storage periodically. Set maxRetryCount to 3-5 so preempted tasks are automatically retried. Allow multiple zones in the allocation policy to maximize Spot VM availability. Keep individual task duration under 1-2 hours to minimize wasted work on preemption. For critical deadlines, consider mixing Spot and standard VMs.

GPU Workloads

Cloud Batch supports GPU-accelerated instances for machine learning training, inference, rendering, and scientific computing. You can attach NVIDIA GPUs (T4, L4, A100, H100) to your batch jobs and use them with CUDA-enabled containers.

gpu-job.json
{
  "taskGroups": [{
    "taskSpec": {
      "runnables": [{
        "container": {
          "imageUri": "us-central1-docker.pkg.dev/MY_PROJECT/ml-repo/training:latest",
          "entrypoint": "python",
          "commands": [
            "train.py",
            "--epochs", "50",
            "--batch-size", "64",
            "--task-index", "${BATCH_TASK_INDEX}"
          ],
          "options": "--gpus all"
        }
      }],
      "computeResource": {
        "cpuMilli": 16000,
        "memoryMib": 65536
      },
      "maxRetryCount": 2,
      "maxRunDuration": "14400s"
    },
    "taskCount": 8,
    "parallelism": 8
  }],
  "allocationPolicy": {
    "instances": [{
      "installGpuDrivers": true,
      "policy": {
        "machineType": "g2-standard-16",
        "accelerators": [{
          "type": "nvidia-l4",
          "count": 1
        }]
      }
    }]
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}
bash
# Submit the GPU job
gcloud batch jobs submit ml-training-job \
  --location=us-central1 \
  --config=gpu-job.json

# Check GPU availability in a region
gcloud compute accelerator-types list \
  --filter="zone:us-central1-a" \
  --format="table(name,description,zone)"

Multi-Step Jobs with Dependencies

A single task can have multiple runnables that execute sequentially. This is useful for setup, processing, and cleanup steps within a single task. For job-level dependencies (one job must complete before another starts), use Workflows or Cloud Composer to orchestrate multiple Batch jobs.

multi-step-job.json
{
  "taskGroups": [{
    "taskSpec": {
      "runnables": [
        {
          "script": {
            "text": "echo 'Step 1: Downloading data...'; gsutil cp gs://my-bucket/input/${BATCH_TASK_INDEX}.csv /tmp/input.csv"
          },
          "displayName": "download-data"
        },
        {
          "container": {
            "imageUri": "us-central1-docker.pkg.dev/MY_PROJECT/my-repo/processor:latest",
            "commands": ["--input", "/tmp/input.csv", "--output", "/tmp/output.csv"]
          },
          "displayName": "process-data"
        },
        {
          "script": {
            "text": "echo 'Step 3: Uploading results...'; gsutil cp /tmp/output.csv gs://my-bucket/output/${BATCH_TASK_INDEX}.csv"
          },
          "displayName": "upload-results"
        }
      ],
      "computeResource": {
        "cpuMilli": 4000,
        "memoryMib": 8192
      },
      "maxRetryCount": 3,
      "maxRunDuration": "1800s"
    },
    "taskCount": 50,
    "parallelism": 10
  }],
  "allocationPolicy": {
    "instances": [{
      "policy": {
        "machineType": "e2-standard-4",
        "provisioningModel": "SPOT"
      }
    }]
  },
  "logsPolicy": {
    "destination": "CLOUD_LOGGING"
  }
}

Mounting Cloud Storage Volumes

Cloud Batch can mount Cloud Storage buckets as local file system paths using Cloud Storage FUSE. This allows your scripts and containers to read and write data using standard file I/O operations instead of using gsutil or the Cloud Storage client libraries.

Volume mount configuration
{
  "taskGroups": [{
    "taskSpec": {
      "runnables": [{
        "script": {
          "text": "ls /mnt/input/ && python /app/process.py --in /mnt/input --out /mnt/output"
        }
      }],
      "volumes": [
        {
          "gcs": { "remotePath": "my-input-bucket/data" },
          "mountPath": "/mnt/input"
        },
        {
          "gcs": { "remotePath": "my-output-bucket/results" },
          "mountPath": "/mnt/output"
        },
        {
          "deviceName": "scratch",
          "mountPath": "/mnt/scratch"
        }
      ],
      "computeResource": {
        "cpuMilli": 4000,
        "memoryMib": 16384,
        "bootDiskMib": 30720
      }
    },
    "taskCount": 20,
    "parallelism": 10
  }],
  "allocationPolicy": {
    "instances": [{
      "policy": {
        "machineType": "e2-standard-4",
        "disks": [{
          "newDisk": {
            "sizeGb": 100,
            "type": "pd-ssd"
          },
          "deviceName": "scratch"
        }]
      }
    }]
  }
}

Monitoring and Troubleshooting

bash
# View detailed job status
gcloud batch jobs describe my-job \
  --location=us-central1 \
  --format="yaml(status)"

# List tasks with their states
gcloud batch tasks list \
  --job=my-job \
  --location=us-central1 \
  --format="table(name.basename(),status.state)" \
  --sort-by=name

# View logs for a specific task
gcloud logging read \
  'resource.type="batch.googleapis.com/Job"
   labels.job_uid="JOB_UID"
   labels.task_id="group0-0-0"' \
  --limit=100 \
  --format="table(timestamp,textPayload)"

# Delete a running job (cancels all tasks)
gcloud batch jobs delete my-job \
  --location=us-central1

# List jobs by state
gcloud batch jobs list \
  --location=us-central1 \
  --filter="status.state=SUCCEEDED" \
  --format="table(name.basename(),createTime,status.state)"

Cost Estimation

Estimate batch job cost before submitting: multiply the VM hourly rate by the number of VMs by estimated runtime. For example, 50 e2-standard-4 Spot VMs running for 1 hour costs approximately 50 x $0.0402/hr = $2.01. Compare this to running a permanent cluster that costs $0.134/hr x 50 VMs x 730 hours = $4,891/month even when idle. Cloud Batch eliminates idle cost entirely.

Cleanup

bash
# Delete completed jobs
gcloud batch jobs list --location=us-central1 \
  --filter="status.state=SUCCEEDED" \
  --format="value(name)" | while read job; do
  gcloud batch jobs delete "${job}" --quiet
done

# Delete failed jobs
gcloud batch jobs list --location=us-central1 \
  --filter="status.state=FAILED" \
  --format="value(name)" | while read job; do
  gcloud batch jobs delete "${job}" --quiet
done
GCP Compute Engine Machine TypesCloud Build CI/CD GuideGCP Cost Optimization

Key Takeaways

  1. 1Cloud Batch provisions VMs on demand and tears them down when jobs complete, eliminating idle costs.
  2. 2Spot VMs reduce batch compute costs by 60-91% with automatic retry on preemption.
  3. 3BATCH_TASK_INDEX and BATCH_TASK_COUNT environment variables enable efficient data partitioning.
  4. 4GPU workloads support NVIDIA T4, L4, A100, and H100 with automatic driver installation.
  5. 5Cloud Storage FUSE mounts enable standard file I/O without rewriting applications for cloud APIs.
  6. 6Multi-step runnables enable download, process, upload patterns within a single task.

Frequently Asked Questions

How does Cloud Batch differ from Dataflow?
Cloud Batch runs arbitrary scripts and containers as batch jobs on VMs you configure. Dataflow is a managed Apache Beam runner for data processing pipelines with built-in windowing, streaming, and data transformation. Use Batch for general-purpose compute tasks (simulations, rendering, ML training) and Dataflow for data ETL pipelines.
Is there a charge for the Batch service itself?
No. Cloud Batch has no additional service fee. You pay only for the underlying Compute Engine VMs, persistent disks, and networking used by your jobs. The Batch orchestration, scheduling, and management are free.
Can Cloud Batch replace a Slurm cluster?
For many HPC workloads, yes. Cloud Batch handles job scheduling, resource provisioning, and task distribution similarly to Slurm. However, Batch lacks some Slurm features like job priorities, fair-share scheduling, and node reservations. For Slurm-specific features, GCP also offers HPC Toolkit with native Slurm on Compute Engine.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.