Multi-CloudComputeadvanced

GPU & ML Training Across Clouds

Compare GPU instances, pricing, and ML platforms across AWS, Azure, GCP, OCI for training and inference workloads.

CloudToolStack Editorial24 min readPublished Mar 14, 2026

Prerequisites

Understanding of machine learning training concepts
Familiarity with GPU types (H100, A100, T4, L4)

GPU and ML Infrastructure Across Clouds

Training large machine learning models requires significant GPU compute, high-bandwidth networking, and fast storage. Each cloud provider offers different GPU instance types, pricing models, and ML platform services. With the explosive growth in AI/ML workloads and chronic GPU shortages, understanding the GPU landscape across providers helps you find capacity, optimize costs, and design efficient training pipelines.

This guide compares GPU instance types and availability across AWS, Azure, GCP, and OCI, covers pricing and reservation strategies, explains distributed training architectures, and provides practical configurations for common ML training workflows.

GPU Shortage Reality

NVIDIA H100 and A100 GPUs are in extremely high demand. Spot/preemptible availability is limited and prices fluctuate significantly. To secure GPU capacity, use reserved instances (1-3 year commitments), request quota increases early, and consider using multiple cloud providers for redundancy. OCI often has better GPU availability due to lower market share.

GPU Instance Types Compared

GPU	AWS	Azure	GCP	OCI
NVIDIA H100 (80GB)	p5.48xlarge (8 GPUs)	ND H100 v5 (8 GPUs)	a3-highgpu-8g (8 GPUs)	BM.GPU.H100.8 (8 GPUs)
NVIDIA A100 (80GB)	p4de.24xlarge (8 GPUs)	ND A100 v4 (8 GPUs)	a2-ultragpu-8g (8 GPUs)	BM.GPU.A100-v2.8 (8 GPUs)
NVIDIA A10G	g5.xlarge-48xlarge	NVadsA10 v5	g2-standard (L4)	VM.GPU.A10.1-2
NVIDIA T4	g4dn.xlarge-metal	NC T4 v3	n1-standard + T4	VM.GPU3.1-4 (V100)
Google TPU	N/A	N/A	TPU v4, v5e, v5p	N/A
AWS Trainium/Inferentia	trn1.32xlarge, inf2	N/A	N/A	N/A

Pricing Comparison

Instance (8x A100 80GB)	On-Demand ($/hr)	Spot/Preemptible ($/hr)	1-Year Reserved ($/hr)
AWS p4de.24xlarge	~$40.97	~$12-20 (variable)	~$25.81 (All Upfront)
Azure ND A100 v4	~$32.77	~$9-15 (variable)	~$21.30 (1yr RI)
GCP a2-ultragpu-8g	~$29.39	~$8-12 (preemptible)	~$18.53 (1yr CUD)
OCI BM.GPU.A100-v2.8	~$27.20	~$8.16 (preemptible)	Flex pricing available

Prices Vary by Region and Change Frequently

GPU pricing varies significantly by region and changes frequently. The prices above are approximate as of early 2026. Always check current pricing in the cloud provider's calculator. Spot/preemptible prices can vary by 50%+ within a single day. Use cost calculators and pricing APIs for accurate estimates.

Setting Up GPU Training on Each Cloud

AWS SageMaker Training

bash

# Launch a SageMaker training job with multiple GPUs
aws sagemaker create-training-job \
  --training-job-name "llm-finetune-$(date +%Y%m%d)" \
  --algorithm-specification '{
    "TrainingImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker",
    "TrainingInputMode": "FastFile"
  }' \
  --resource-config '{
    "InstanceType": "ml.p4de.24xlarge",
    "InstanceCount": 4,
    "VolumeSizeInGB": 500
  }' \
  --input-data-config '[{
    "ChannelName": "training",
    "DataSource": {"S3DataSource": {"S3Uri": "s3://training-data/dataset/", "S3DataType": "S3Prefix"}},
    "InputMode": "FastFile"
  }]' \
  --output-data-config '{"S3OutputPath": "s3://model-artifacts/"}' \
  --role-arn "arn:aws:iam::123456789012:role/SageMakerRole" \
  --stopping-condition '{"MaxRuntimeInSeconds": 86400}'

# Or launch EC2 GPU instances directly
aws ec2 run-instances \
  --instance-type p4de.24xlarge \
  --image-id ami-deep-learning \
  --key-name ml-keypair \
  --count 4 \
  --placement '{"GroupName": "ml-cluster", "Tenancy": "default"}' \
  --network-interfaces '[{"DeviceIndex": 0, "SubnetId": "subnet-abc123", "Groups": ["sg-ml"]}]'

GCP Vertex AI Training

bash

# Submit a Vertex AI custom training job
gcloud ai custom-jobs create \
  --display-name="llm-finetune" \
  --region=us-central1 \
  --worker-pool-spec=machine-type=a2-ultragpu-8g,replica-count=4,accelerator-type=NVIDIA_A100_80GB,accelerator-count=8,container-image-uri=us-docker.pkg.dev/PROJECT/repo/training:latest

# Create a GKE node pool with GPUs
gcloud container node-pools create gpu-pool \
  --cluster=ml-cluster \
  --zone=us-central1-a \
  --machine-type=a2-ultragpu-8g \
  --accelerator=type=nvidia-a100-80gb,count=8 \
  --num-nodes=4 \
  --min-nodes=0 \
  --max-nodes=8 \
  --enable-autoscaling \
  --spot

Azure ML Training

bash

# Create an Azure ML compute cluster
az ml compute create \
  --name gpu-cluster \
  --type AmlCompute \
  --size Standard_ND96amsr_A100_v4 \
  --min-instances 0 \
  --max-instances 8 \
  --idle-time-before-scale-down 600 \
  --resource-group ml-rg \
  --workspace-name ml-workspace

# Submit a training job
az ml job create --file training-job.yml \
  --resource-group ml-rg \
  --workspace-name ml-workspace

Distributed Training Architecture

Large model training requires distributed training across multiple GPUs and nodes. The two main parallelism strategies are data parallelism (each GPU processes different data with the same model) and model parallelism (the model is split across GPUs). Modern frameworks like DeepSpeed, FSDP, and Megatron-LM support hybrid parallelism.

Network Requirements for Distributed Training

Feature	AWS	Azure	GCP
GPU interconnect	NVLink/NVSwitch (intra-node)	NVLink/NVSwitch	NVLink/NVSwitch
Inter-node networking	EFA (400 Gbps)	InfiniBand (400 Gbps)	GPUDirect-TCPX (200 Gbps)
Placement groups	Cluster placement group	Proximity placement group	Compact placement policy
Storage for training	FSx for Lustre, S3	Blob, NetApp Files	Filestore, GCS FUSE

python

# PyTorch distributed training with DeepSpeed
# launch command: deepspeed --num_gpus=8 --num_nodes=4 train.py

import deepspeed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def train():
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

    ds_config = {
        "train_batch_size": 256,
        "gradient_accumulation_steps": 4,
        "fp16": {"enabled": True},
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {"device": "cpu"},
            "offload_param": {"device": "cpu"},
            "overlap_comm": True,
            "contiguous_gradients": True,
        },
        "optimizer": {
            "type": "AdamW",
            "params": {"lr": 2e-5, "weight_decay": 0.01}
        },
        "scheduler": {
            "type": "WarmupDecayLR",
            "params": {"warmup_num_steps": 100, "total_num_steps": 10000}
        }
    }

    model_engine, optimizer, _, _ = deepspeed.initialize(
        model=model, config=ds_config
    )

    # Training loop
    for batch in dataloader:
        loss = model_engine(batch)
        model_engine.backward(loss)
        model_engine.step()

if __name__ == "__main__":
    train()

Cost Optimization for GPU Workloads

Strategy	Savings	Trade-off
Spot/preemptible instances	60-90%	Interruption risk (use checkpointing)
Reserved instances (1yr)	30-40%	Commitment, inflexibility
Right-sizing GPUs	20-50%	May need architecture changes
Mixed precision (FP16/BF16)	2x training speed	Minor accuracy impact (usually negligible)
Gradient checkpointing	Larger batch sizes, fewer GPUs	~20% slower per step
Multi-cloud spot arbitrage	Variable	Complexity, data transfer costs

Inference Across Clouds

bash

# AWS: Deploy model with SageMaker inference
aws sagemaker create-endpoint-config \
  --endpoint-config-name "llm-endpoint-config" \
  --production-variants '[{
    "VariantName": "primary",
    "ModelName": "my-llm-model",
    "InstanceType": "ml.g5.2xlarge",
    "InitialInstanceCount": 2,
    "InitialVariantWeight": 1
  }]'

# GCP: Deploy with Vertex AI
gcloud ai endpoints deploy-model ENDPOINT_ID \
  --model=MODEL_ID \
  --display-name="llm-deployment" \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --min-replica-count=1 \
  --max-replica-count=10

# Azure: Deploy with Azure ML
az ml online-deployment create \
  --name gpu-deployment \
  --endpoint-name llm-endpoint \
  --model azureml:my-model@latest \
  --instance-type Standard_NC24ads_A100_v4 \
  --instance-count 2

Use Managed AI Services When Possible

Before training your own models, evaluate managed AI services: AWS Bedrock, Azure OpenAI Service, GCP Vertex AI with Gemini. These services provide access to state-of-the-art models without GPU management. Fine-tuning on managed services is often cheaper and faster than training from scratch. Only train custom models when you need domain-specific capabilities that managed services do not provide.

Multi-Cloud AI Services Comparison Multi-Cloud Cost Comparison

Key Takeaways

1H100 and A100 instances are available on all major clouds with different pricing and availability.
2OCI often has better GPU availability and competitive pricing due to lower market share.
3Spot/preemptible instances save 60-90% but require checkpointing for training resilience.
4Distributed training requires EFA (AWS), InfiniBand (Azure), or GPUDirect-TCPX (GCP).

Frequently Asked Questions

Which cloud is cheapest for GPU training?

GCP and OCI tend to have lower on-demand pricing. Spot prices fluctuate across providers. Factor in data transfer costs if your data is in a different cloud.

Should I train models or use managed AI services?

Evaluate managed services first (Bedrock, Azure OpenAI, Vertex AI). Only train custom models when managed services cannot provide domain-specific capabilities.

Written by CloudToolStack Editorial

Written and reviewed by the CloudToolStack editorial team. Every guide is verified against current provider documentation and revised in place when providers change pricing, deprecate services, or release meaningfully better alternatives.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.