Compare GPU instances for ML training and inference across all major clouds.
Showing 20 of 20 features.
| Feature | AWS | Azure | GCP | OCI |
|---|---|---|---|---|
Training-Optimized Instances Instance Types | P5 (H100), P4d (A100), P3 (V100) | ND H100 v5, ND A100 v4, ND A10 v5 | A3 (H100), A2 (A100) machine types | BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8 (A100) |
Inference-Optimized Instances Instance Types | Inf2 (Inferentia2), G5 (A10G), G6 (L4) | NC A10 v4, NV A10 v5, ND A10 v5 | G2 (L4), custom TPU v5e for inference | VM.GPU.A10.1/2, BM.GPU.A10.4 |
Graphics / Visualization Instance Types | G5 (A10G), G4dn (T4) for rendering and VDI | NV v4 (Radeon MI25), NVads A10 v5 for virtual desktops | N1 + T4/P4, G2 (L4) for rendering workloads | VM.GPU2.1 (P100), VM.GPU3.x (V100) for visualization |
Entry-Level GPU Instance Types | G4dn (T4) starting at $0.526/hr | NC T4 v3 starting at $0.526/hr | N1 + T4 starting at $0.35/hr (preemptible lower) | VM.GPU2.1 (P100) starting at $1.28/hr |
Max GPUs per Instance Instance Types | 8x H100 (P5), 8x A100 (P4d), 16x Trainium (Trn1.32xl) | 8x H100 (ND H100 v5), 8x A100 (ND96asr v4) | 8x H100 (A3), 8x A100 (A2-megagpu), 8x L4 (G2) | 8x H100 (BM.GPU.H100.8), 8x A100 bare metal |
Latest GPU Generation GPU Hardware | NVIDIA H100 80GB SXM5 (P5 instances) | NVIDIA H100 80GB SXM5 (ND H100 v5) | NVIDIA H100 80GB SXM5 (A3 Mega), TPU v5p | NVIDIA H100 80GB SXM5 (BM.GPU.H100.8) |
GPU Memory per Device GPU Hardware | H100: 80GB HBM3, A100: 40/80GB HBM2e, A10G: 24GB GDDR6 | H100: 80GB HBM3, A100: 80GB HBM2e, A10: 24GB GDDR6 | H100: 80GB HBM3, A100: 40/80GB HBM2e, L4: 24GB GDDR6 | H100: 80GB HBM3, A100: 40/80GB HBM2e, A10: 24GB GDDR6 |
GPU Interconnect GPU Hardware | NVSwitch + NVLink (900 GB/s per GPU on P5) | NVSwitch + NVLink (900 GB/s on ND H100 v5) | NVSwitch + NVLink; TPU ICI mesh for TPU pods | NVSwitch + NVLink (900 GB/s on BM.GPU.H100.8) |
Network Bandwidth GPU Hardware | P5: 3.2 Tbps EFA; P4d: 400 Gbps EFA | ND H100 v5: 3.2 Tbps InfiniBand; ND A100: 1.6 Tbps IB | A3 Mega: 1.8 Tbps (9x200G NICs); A2: 100 Gbps | BM.GPU.H100.8: 1.6 Tbps RDMA cluster networking |
Custom Accelerators GPU Hardware | AWS Trainium (Trn1), Inferentia 2 (Inf2) custom chips | Maia AI accelerator (preview); primarily NVIDIA GPUs | TPU v5p, v5e, v4 for ML training and inference | No custom accelerators; NVIDIA GPU only |
On-Demand H100 (8-GPU) Pricing & Availability | P5.48xlarge: ~$98.32/hr | ND H100 v5: ~$98.32/hr | A3-megagpu-8g: ~$98.32/hr | BM.GPU.H100.8: ~$78.65/hr |
Spot / Preemptible Discount Pricing & Availability | Spot Instances: up to 90% off (interruption risk) | Spot VMs: up to 60-80% off (eviction risk) | Spot VMs: 60-91% discount (preemption risk) | Preemptible instances: ~50% discount on GPU shapes |
Reserved Pricing Pricing & Availability | 1-year or 3-year Reserved Instances: 30-60% savings | 1-year or 3-year Reservations: 20-57% savings | 1-year or 3-year CUDs: 20-57% savings | Annual Flex commitment pricing; capacity reservations |
Regional Availability Pricing & Availability | H100 in us-east-1, us-west-2; A100 in 10+ regions | H100 in East US, West Europe; limited to select regions | H100 in us-central1, europe-west4; A100 in 8+ regions | H100 in US regions; A100 in multiple regions |
Capacity Reservations Pricing & Availability | On-Demand Capacity Reservations per AZ | Capacity Reservations; dedicated host groups | Reservations with assured capacity; future reservations | Capacity reservations with dedicated bare-metal pools |
ML Framework Support Software & Features | Deep Learning AMIs with PyTorch, TensorFlow, JAX, MXNet | Data Science VMs with PyTorch, TensorFlow; Azure ML | Deep Learning VMs with PyTorch, TensorFlow, JAX | Data Science platform images with PyTorch, TensorFlow |
Container / Kubernetes Software & Features | EKS with GPU operator, Batch, SageMaker Training | AKS with GPU node pools, Azure ML compute clusters | GKE with GPU node pools, Vertex AI training | OKE with GPU node pools, Data Science Jobs |
Distributed Training Software & Features | EFA for NCCL; SageMaker distributed training libraries | InfiniBand for NCCL; Azure ML distributed jobs | GPUDirect-TCPX; Vertex AI distributed training | RDMA cluster networking for NCCL; Data Science jobs |
CUDA & Driver Management Software & Features | Pre-installed NVIDIA drivers in DL AMIs; manual install option | NVIDIA GPU driver extension auto-install on VMs | GPU driver install script or Container-Optimized OS | Pre-installed drivers on GPU platform images |
Multi-Instance GPU (MIG) Software & Features | A100 MIG support for partitioning into 7 instances | A100 MIG support on ND A100 v4 | A100 MIG support for workload isolation | A100 MIG support on bare metal and VM shapes |
GPU instances for machine learning training and inference are available across all major clouds, but they differ significantly in GPU types (NVIDIA A100, H100, H200, L4, T4, AMD MI300X), interconnect bandwidth, memory configurations, pricing models, and availability. AWS offers P, G, and Inf instance families, Azure provides NC, ND, and NV series, GCP has A2 and A3 machine types, and OCI offers GPU shapes with flexible OCPUs. This comparison tool helps ML engineers and data scientists find the right GPU instances across clouds based on workload requirements, budget, and availability.
Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.