Skip to main content
Multi-CloudComputeadvanced

GPU & ML Training Across Clouds

Compare GPU instances, pricing, and ML platforms across AWS, Azure, GCP, OCI for training and inference workloads.

CloudToolStack Team24 min readPublished Mar 14, 2026

Prerequisites

  • Understanding of machine learning training concepts
  • Familiarity with GPU types (H100, A100, T4, L4)

GPU and ML Infrastructure Across Clouds

Training large machine learning models requires significant GPU compute, high-bandwidth networking, and fast storage. Each cloud provider offers different GPU instance types, pricing models, and ML platform services. With the explosive growth in AI/ML workloads and chronic GPU shortages, understanding the GPU landscape across providers helps you find capacity, optimize costs, and design efficient training pipelines.

This guide compares GPU instance types and availability across AWS, Azure, GCP, and OCI, covers pricing and reservation strategies, explains distributed training architectures, and provides practical configurations for common ML training workflows.

GPU Shortage Reality

NVIDIA H100 and A100 GPUs are in extremely high demand. Spot/preemptible availability is limited and prices fluctuate significantly. To secure GPU capacity, use reserved instances (1-3 year commitments), request quota increases early, and consider using multiple cloud providers for redundancy. OCI often has better GPU availability due to lower market share.

GPU Instance Types Compared

GPUAWSAzureGCPOCI
NVIDIA H100 (80GB)p5.48xlarge (8 GPUs)ND H100 v5 (8 GPUs)a3-highgpu-8g (8 GPUs)BM.GPU.H100.8 (8 GPUs)
NVIDIA A100 (80GB)p4de.24xlarge (8 GPUs)ND A100 v4 (8 GPUs)a2-ultragpu-8g (8 GPUs)BM.GPU.A100-v2.8 (8 GPUs)
NVIDIA A10Gg5.xlarge-48xlargeNVadsA10 v5g2-standard (L4)VM.GPU.A10.1-2
NVIDIA T4g4dn.xlarge-metalNC T4 v3n1-standard + T4VM.GPU3.1-4 (V100)
Google TPUN/AN/ATPU v4, v5e, v5pN/A
AWS Trainium/Inferentiatrn1.32xlarge, inf2N/AN/AN/A

Pricing Comparison

Instance (8x A100 80GB)On-Demand ($/hr)Spot/Preemptible ($/hr)1-Year Reserved ($/hr)
AWS p4de.24xlarge~$40.97~$12-20 (variable)~$25.81 (All Upfront)
Azure ND A100 v4~$32.77~$9-15 (variable)~$21.30 (1yr RI)
GCP a2-ultragpu-8g~$29.39~$8-12 (preemptible)~$18.53 (1yr CUD)
OCI BM.GPU.A100-v2.8~$27.20~$8.16 (preemptible)Flex pricing available

Prices Vary by Region and Change Frequently

GPU pricing varies significantly by region and changes frequently. The prices above are approximate as of early 2026. Always check current pricing in the cloud provider's calculator. Spot/preemptible prices can vary by 50%+ within a single day. Use cost calculators and pricing APIs for accurate estimates.

Setting Up GPU Training on Each Cloud

AWS SageMaker Training

bash
# Launch a SageMaker training job with multiple GPUs
aws sagemaker create-training-job \
  --training-job-name "llm-finetune-$(date +%Y%m%d)" \
  --algorithm-specification '{
    "TrainingImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker",
    "TrainingInputMode": "FastFile"
  }' \
  --resource-config '{
    "InstanceType": "ml.p4de.24xlarge",
    "InstanceCount": 4,
    "VolumeSizeInGB": 500
  }' \
  --input-data-config '[{
    "ChannelName": "training",
    "DataSource": {"S3DataSource": {"S3Uri": "s3://training-data/dataset/", "S3DataType": "S3Prefix"}},
    "InputMode": "FastFile"
  }]' \
  --output-data-config '{"S3OutputPath": "s3://model-artifacts/"}' \
  --role-arn "arn:aws:iam::123456789012:role/SageMakerRole" \
  --stopping-condition '{"MaxRuntimeInSeconds": 86400}'

# Or launch EC2 GPU instances directly
aws ec2 run-instances \
  --instance-type p4de.24xlarge \
  --image-id ami-deep-learning \
  --key-name ml-keypair \
  --count 4 \
  --placement '{"GroupName": "ml-cluster", "Tenancy": "default"}' \
  --network-interfaces '[{"DeviceIndex": 0, "SubnetId": "subnet-abc123", "Groups": ["sg-ml"]}]'

GCP Vertex AI Training

bash
# Submit a Vertex AI custom training job
gcloud ai custom-jobs create \
  --display-name="llm-finetune" \
  --region=us-central1 \
  --worker-pool-spec=machine-type=a2-ultragpu-8g,replica-count=4,accelerator-type=NVIDIA_A100_80GB,accelerator-count=8,container-image-uri=us-docker.pkg.dev/PROJECT/repo/training:latest

# Create a GKE node pool with GPUs
gcloud container node-pools create gpu-pool \
  --cluster=ml-cluster \
  --zone=us-central1-a \
  --machine-type=a2-ultragpu-8g \
  --accelerator=type=nvidia-a100-80gb,count=8 \
  --num-nodes=4 \
  --min-nodes=0 \
  --max-nodes=8 \
  --enable-autoscaling \
  --spot

Azure ML Training

bash
# Create an Azure ML compute cluster
az ml compute create \
  --name gpu-cluster \
  --type AmlCompute \
  --size Standard_ND96amsr_A100_v4 \
  --min-instances 0 \
  --max-instances 8 \
  --idle-time-before-scale-down 600 \
  --resource-group ml-rg \
  --workspace-name ml-workspace

# Submit a training job
az ml job create --file training-job.yml \
  --resource-group ml-rg \
  --workspace-name ml-workspace

Distributed Training Architecture

Large model training requires distributed training across multiple GPUs and nodes. The two main parallelism strategies are data parallelism (each GPU processes different data with the same model) and model parallelism (the model is split across GPUs). Modern frameworks like DeepSpeed, FSDP, and Megatron-LM support hybrid parallelism.

Network Requirements for Distributed Training

FeatureAWSAzureGCP
GPU interconnectNVLink/NVSwitch (intra-node)NVLink/NVSwitchNVLink/NVSwitch
Inter-node networkingEFA (400 Gbps)InfiniBand (400 Gbps)GPUDirect-TCPX (200 Gbps)
Placement groupsCluster placement groupProximity placement groupCompact placement policy
Storage for trainingFSx for Lustre, S3Blob, NetApp FilesFilestore, GCS FUSE
python
# PyTorch distributed training with DeepSpeed
# launch command: deepspeed --num_gpus=8 --num_nodes=4 train.py

import deepspeed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def train():
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

    ds_config = {
        "train_batch_size": 256,
        "gradient_accumulation_steps": 4,
        "fp16": {"enabled": True},
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {"device": "cpu"},
            "offload_param": {"device": "cpu"},
            "overlap_comm": True,
            "contiguous_gradients": True,
        },
        "optimizer": {
            "type": "AdamW",
            "params": {"lr": 2e-5, "weight_decay": 0.01}
        },
        "scheduler": {
            "type": "WarmupDecayLR",
            "params": {"warmup_num_steps": 100, "total_num_steps": 10000}
        }
    }

    model_engine, optimizer, _, _ = deepspeed.initialize(
        model=model, config=ds_config
    )

    # Training loop
    for batch in dataloader:
        loss = model_engine(batch)
        model_engine.backward(loss)
        model_engine.step()

if __name__ == "__main__":
    train()

Cost Optimization for GPU Workloads

StrategySavingsTrade-off
Spot/preemptible instances60-90%Interruption risk (use checkpointing)
Reserved instances (1yr)30-40%Commitment, inflexibility
Right-sizing GPUs20-50%May need architecture changes
Mixed precision (FP16/BF16)2x training speedMinor accuracy impact (usually negligible)
Gradient checkpointingLarger batch sizes, fewer GPUs~20% slower per step
Multi-cloud spot arbitrageVariableComplexity, data transfer costs

Inference Across Clouds

bash
# AWS: Deploy model with SageMaker inference
aws sagemaker create-endpoint-config \
  --endpoint-config-name "llm-endpoint-config" \
  --production-variants '[{
    "VariantName": "primary",
    "ModelName": "my-llm-model",
    "InstanceType": "ml.g5.2xlarge",
    "InitialInstanceCount": 2,
    "InitialVariantWeight": 1
  }]'

# GCP: Deploy with Vertex AI
gcloud ai endpoints deploy-model ENDPOINT_ID \
  --model=MODEL_ID \
  --display-name="llm-deployment" \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --min-replica-count=1 \
  --max-replica-count=10

# Azure: Deploy with Azure ML
az ml online-deployment create \
  --name gpu-deployment \
  --endpoint-name llm-endpoint \
  --model azureml:my-model@latest \
  --instance-type Standard_NC24ads_A100_v4 \
  --instance-count 2

Use Managed AI Services When Possible

Before training your own models, evaluate managed AI services: AWS Bedrock, Azure OpenAI Service, GCP Vertex AI with Gemini. These services provide access to state-of-the-art models without GPU management. Fine-tuning on managed services is often cheaper and faster than training from scratch. Only train custom models when you need domain-specific capabilities that managed services do not provide.

Multi-Cloud AI Services ComparisonMulti-Cloud Cost Comparison

Key Takeaways

  1. 1H100 and A100 instances are available on all major clouds with different pricing and availability.
  2. 2OCI often has better GPU availability and competitive pricing due to lower market share.
  3. 3Spot/preemptible instances save 60-90% but require checkpointing for training resilience.
  4. 4Distributed training requires EFA (AWS), InfiniBand (Azure), or GPUDirect-TCPX (GCP).

Frequently Asked Questions

Which cloud is cheapest for GPU training?
GCP and OCI tend to have lower on-demand pricing. Spot prices fluctuate across providers. Factor in data transfer costs if your data is in a different cloud.
Should I train models or use managed AI services?
Evaluate managed services first (Bedrock, Azure OpenAI, Vertex AI). Only train custom models when managed services cannot provide domain-specific capabilities.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.