Compare GPU instances for ML training and inference across all major clouds.
Last verified: May 2026
Showing 20 of 20 features.
| Feature | AWS | Azure | GCP | OCI |
|---|---|---|---|---|
Training-Optimized Instances Instance Types | P5 (H100), P4d (A100), P3 (V100) | ND H100 v5, ND A100 v4, ND A10 v5 | A3 (H100), A2 (A100) machine types | BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8 (A100) |
Inference-Optimized Instances Instance Types | Inf2 (Inferentia2), G5 (A10G), G6 (L4) | NC A10 v4, NV A10 v5, ND A10 v5 | G2 (L4), custom TPU v5e for inference | VM.GPU.A10.1/2, BM.GPU.A10.4 |
Graphics / Visualization Instance Types | G5 (A10G), G4dn (T4) for rendering and VDI | NV v4 (Radeon MI25), NVads A10 v5 for virtual desktops | N1 + T4/P4, G2 (L4) for rendering workloads | VM.GPU2.1 (P100), VM.GPU3.x (V100) for visualization |
Entry-Level GPU Instance Types | G4dn (T4) starting at $0.526/hr | NC T4 v3 starting at $0.526/hr | N1 + T4 starting at $0.35/hr (preemptible lower) | VM.GPU2.1 (P100) starting at $1.28/hr |
Max GPUs per Instance Instance Types | 8x H100 (P5), 8x A100 (P4d), 16x Trainium (Trn1.32xl) | 8x H100 (ND H100 v5), 8x A100 (ND96asr v4) | 8x H100 (A3), 8x A100 (A2-megagpu), 8x L4 (G2) | 8x H100 (BM.GPU.H100.8), 8x A100 bare metal |
Latest GPU Generation GPU Hardware | NVIDIA H100 80GB SXM5 (P5 instances) | NVIDIA H100 80GB SXM5 (ND H100 v5) | NVIDIA H100 80GB SXM5 (A3 Mega), TPU v5p | NVIDIA H100 80GB SXM5 (BM.GPU.H100.8) |
GPU Memory per Device GPU Hardware | H100: 80GB HBM3, A100: 40/80GB HBM2e, A10G: 24GB GDDR6 | H100: 80GB HBM3, A100: 80GB HBM2e, A10: 24GB GDDR6 | H100: 80GB HBM3, A100: 40/80GB HBM2e, L4: 24GB GDDR6 | H100: 80GB HBM3, A100: 40/80GB HBM2e, A10: 24GB GDDR6 |
GPU Interconnect GPU Hardware | NVSwitch + NVLink (900 GB/s per GPU on P5) | NVSwitch + NVLink (900 GB/s on ND H100 v5) | NVSwitch + NVLink; TPU ICI mesh for TPU pods | NVSwitch + NVLink (900 GB/s on BM.GPU.H100.8) |
Network Bandwidth GPU Hardware | P5: 3.2 Tbps EFA; P4d: 400 Gbps EFA | ND H100 v5: 3.2 Tbps InfiniBand; ND A100: 1.6 Tbps IB | A3 Mega: 1.8 Tbps (9x200G NICs); A2: 100 Gbps | BM.GPU.H100.8: 1.6 Tbps RDMA cluster networking |
Custom Accelerators GPU Hardware | AWS Trainium (Trn1), Inferentia 2 (Inf2) custom chips | Maia AI accelerator (preview); primarily NVIDIA GPUs | TPU v5p, v5e, v4 for ML training and inference | No custom accelerators; NVIDIA GPU only |
On-Demand H100 (8-GPU) Pricing & Availability | P5.48xlarge: ~$98.32/hr | ND H100 v5: ~$98.32/hr | A3-megagpu-8g: ~$98.32/hr | BM.GPU.H100.8: ~$78.65/hr |
Spot / Preemptible Discount Pricing & Availability | Spot Instances: up to 90% off (interruption risk) | Spot VMs: up to 60-80% off (eviction risk) | Spot VMs: 60-91% discount (preemption risk) | Preemptible instances: ~50% discount on GPU shapes |
Reserved Pricing Pricing & Availability | 1-year or 3-year Reserved Instances: 30-60% savings | 1-year or 3-year Reservations: 20-57% savings | 1-year or 3-year CUDs: 20-57% savings | Annual Flex commitment pricing; capacity reservations |
Regional Availability Pricing & Availability | H100 in us-east-1, us-west-2; A100 in 10+ regions | H100 in East US, West Europe; limited to select regions | H100 in us-central1, europe-west4; A100 in 8+ regions | H100 in US regions; A100 in multiple regions |
Capacity Reservations Pricing & Availability | On-Demand Capacity Reservations per AZ | Capacity Reservations; dedicated host groups | Reservations with assured capacity; future reservations | Capacity reservations with dedicated bare-metal pools |
ML Framework Support Software & Features | Deep Learning AMIs with PyTorch, TensorFlow, JAX, MXNet | Data Science VMs with PyTorch, TensorFlow; Azure ML | Deep Learning VMs with PyTorch, TensorFlow, JAX | Data Science platform images with PyTorch, TensorFlow |
Container / Kubernetes Software & Features | EKS with GPU operator, Batch, SageMaker Training | AKS with GPU node pools, Azure ML compute clusters | GKE with GPU node pools, Vertex AI training | OKE with GPU node pools, Data Science Jobs |
Distributed Training Software & Features | EFA for NCCL; SageMaker distributed training libraries | InfiniBand for NCCL; Azure ML distributed jobs | GPUDirect-TCPX; Vertex AI distributed training | RDMA cluster networking for NCCL; Data Science jobs |
CUDA & Driver Management Software & Features | Pre-installed NVIDIA drivers in DL AMIs; manual install option | NVIDIA GPU driver extension auto-install on VMs | GPU driver install script or Container-Optimized OS | Pre-installed drivers on GPU platform images |
Multi-Instance GPU (MIG) Software & Features | A100 MIG support for partitioning into 7 instances | A100 MIG support on ND A100 v4 | A100 MIG support for workload isolation | A100 MIG support on bare metal and VM shapes |
The compare tool catalogs GPU instance offerings across all four clouds: GPU type (H100, A100, H200, L4, T4, V100), GPU count per instance, GPU memory (40GB/80GB), system RAM, inter-GPU interconnect (NVLink, NVSwitch), inter-node networking (EFA, GPUDirect-TCPX, InfiniBand), local NVMe storage, on-demand and spot pricing per region, and reserved capacity availability. Side-by-side tables surface cost-effective options for specific workloads.
GPU instances for machine learning training and inference are available across all major clouds, but they differ significantly in GPU types (NVIDIA A100, H100, H200, L4, T4, AMD MI300X), interconnect bandwidth, memory configurations, pricing models, and availability. AWS offers P, G, and Inf instance families, Azure provides NC, ND, and NV series, GCP has A2 and A3 machine types, and OCI offers GPU shapes with flexible OCPUs. This comparison tool helps ML engineers and data scientists find the right GPU instances across clouds based on workload requirements, budget, and availability.
Your ML team needs to train a 70B parameter model — workload requires 8x H100 80GB GPUs with high inter-GPU bandwidth. The compare tool shows: AWS p5.48xlarge ($98/hr on-demand, $40/hr spot), Azure ND H100 v5 ($110/hr on-demand, no spot), GCP a3-highgpu-8g ($88/hr on-demand, $35/hr spot), OCI BM.GPU.H100.8 ($82/hr on-demand). For a 30-day training run on spot capacity, GCP wins at ~$25K total vs AWS at $29K vs Azure at $79K (no spot). The team chooses GCP, builds checkpointing into the training loop, and runs the job for 60% less than they originally budgeted on Azure.
GPU instance availability is the BIG hidden constraint. H100 instances are perpetually capacity-constrained on AWS and Azure. GCP and OCI have generally better availability for cutting-edge GPUs. For training jobs that need a specific instance type at specific times, plan ahead with Reserved or Capacity Reservations — on-demand H100 in popular regions can have multi-hour provisioning failures.
Inter-GPU interconnect bandwidth matters MORE than per-GPU FLOPs for distributed training. AWS p5 instances with EFA + 3,200 Gbps inter-node networking outperform similar A100 instances with slower interconnect. For 8-GPU training, NVLink-equipped instances are essential; cross-node training needs RDMA-capable networking.
Spot/preemptible GPU pricing is the killer cost feature when your workload is checkpoint-able. ML training jobs with checkpoints every 30 minutes can use spot GPUs at 60-70% discount with maybe 10-15% extra walltime due to occasional interruptions. Net effect: same job at half the cost.
Pricing depends on the GPU type and commitment level. For on-demand A100 instances, OCI is often the most affordable with NVIDIA GPU instances at competitive per-hour rates. For spot/preemptible pricing, AWS and GCP typically offer 60-70% discounts but with interruption risk. Azure has limited spot availability for GPU VMs. For reserved commitments (1-3 years), all clouds offer 30-60% savings. The total cost also depends on interconnect — multi-GPU training jobs on AWS p5 instances benefit from EFA networking, while GCP a3 instances use GPUDirect-TCPX. Compare the complete cost including networking and storage for your specific workload.
Training workloads need high memory GPUs (A100 80GB, H100 80GB) with fast inter-GPU interconnect (NVLink, NVSwitch) and high network bandwidth (EFA, GPUDirect). Use P5/P4d (AWS), ND H100/ND A100 (Azure), A3/A2 (GCP) for training. Inference workloads prioritize cost per inference and may use smaller GPUs like T4, L4, or specialized accelerators like AWS Inferentia (Inf2). For batch inference with variable demand, consider spot instances. For real-time inference, use on-demand instances or reserved capacity with auto-scaling.
Was this tool helpful?
Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.