Multi-Cloud GPU Instance Compare

ComputeMulti-Cloud

Compare GPU instances for ML training and inference across all major clouds.

Last verified: May 2026

Filter Comparison

Category

Showing 20 of 20 features.

Feature	AWS	Azure	GCP	OCI
Training-Optimized Instances Instance Types	P5 (H100), P4d (A100), P3 (V100)	ND H100 v5, ND A100 v4, ND A10 v5	A3 (H100), A2 (A100) machine types	BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8 (A100)
Inference-Optimized Instances Instance Types	Inf2 (Inferentia2), G5 (A10G), G6 (L4)	NC A10 v4, NV A10 v5, ND A10 v5	G2 (L4), custom TPU v5e for inference	VM.GPU.A10.1/2, BM.GPU.A10.4
Graphics / Visualization Instance Types	G5 (A10G), G4dn (T4) for rendering and VDI	NV v4 (Radeon MI25), NVads A10 v5 for virtual desktops	N1 + T4/P4, G2 (L4) for rendering workloads	VM.GPU2.1 (P100), VM.GPU3.x (V100) for visualization
Entry-Level GPU Instance Types	G4dn (T4) starting at $0.526/hr	NC T4 v3 starting at $0.526/hr	N1 + T4 starting at $0.35/hr (preemptible lower)	VM.GPU2.1 (P100) starting at $1.28/hr
Max GPUs per Instance Instance Types	8x H100 (P5), 8x A100 (P4d), 16x Trainium (Trn1.32xl)	8x H100 (ND H100 v5), 8x A100 (ND96asr v4)	8x H100 (A3), 8x A100 (A2-megagpu), 8x L4 (G2)	8x H100 (BM.GPU.H100.8), 8x A100 bare metal
Latest GPU Generation GPU Hardware	NVIDIA H100 80GB SXM5 (P5 instances)	NVIDIA H100 80GB SXM5 (ND H100 v5)	NVIDIA H100 80GB SXM5 (A3 Mega), TPU v5p	NVIDIA H100 80GB SXM5 (BM.GPU.H100.8)
GPU Memory per Device GPU Hardware	H100: 80GB HBM3, A100: 40/80GB HBM2e, A10G: 24GB GDDR6	H100: 80GB HBM3, A100: 80GB HBM2e, A10: 24GB GDDR6	H100: 80GB HBM3, A100: 40/80GB HBM2e, L4: 24GB GDDR6	H100: 80GB HBM3, A100: 40/80GB HBM2e, A10: 24GB GDDR6
GPU Interconnect GPU Hardware	NVSwitch + NVLink (900 GB/s per GPU on P5)	NVSwitch + NVLink (900 GB/s on ND H100 v5)	NVSwitch + NVLink; TPU ICI mesh for TPU pods	NVSwitch + NVLink (900 GB/s on BM.GPU.H100.8)
Network Bandwidth GPU Hardware	P5: 3.2 Tbps EFA; P4d: 400 Gbps EFA	ND H100 v5: 3.2 Tbps InfiniBand; ND A100: 1.6 Tbps IB	A3 Mega: 1.8 Tbps (9x200G NICs); A2: 100 Gbps	BM.GPU.H100.8: 1.6 Tbps RDMA cluster networking
Custom Accelerators GPU Hardware	AWS Trainium (Trn1), Inferentia 2 (Inf2) custom chips	Maia AI accelerator (preview); primarily NVIDIA GPUs	TPU v5p, v5e, v4 for ML training and inference	No custom accelerators; NVIDIA GPU only
On-Demand H100 (8-GPU) Pricing & Availability	P5.48xlarge: ~$98.32/hr	ND H100 v5: ~$98.32/hr	A3-megagpu-8g: ~$98.32/hr	BM.GPU.H100.8: ~$78.65/hr
Spot / Preemptible Discount Pricing & Availability	Spot Instances: up to 90% off (interruption risk)	Spot VMs: up to 60-80% off (eviction risk)	Spot VMs: 60-91% discount (preemption risk)	Preemptible instances: ~50% discount on GPU shapes
Reserved Pricing Pricing & Availability	1-year or 3-year Reserved Instances: 30-60% savings	1-year or 3-year Reservations: 20-57% savings	1-year or 3-year CUDs: 20-57% savings	Annual Flex commitment pricing; capacity reservations
Regional Availability Pricing & Availability	H100 in us-east-1, us-west-2; A100 in 10+ regions	H100 in East US, West Europe; limited to select regions	H100 in us-central1, europe-west4; A100 in 8+ regions	H100 in US regions; A100 in multiple regions
Capacity Reservations Pricing & Availability	On-Demand Capacity Reservations per AZ	Capacity Reservations; dedicated host groups	Reservations with assured capacity; future reservations	Capacity reservations with dedicated bare-metal pools
ML Framework Support Software & Features	Deep Learning AMIs with PyTorch, TensorFlow, JAX, MXNet	Data Science VMs with PyTorch, TensorFlow; Azure ML	Deep Learning VMs with PyTorch, TensorFlow, JAX	Data Science platform images with PyTorch, TensorFlow
Container / Kubernetes Software & Features	EKS with GPU operator, Batch, SageMaker Training	AKS with GPU node pools, Azure ML compute clusters	GKE with GPU node pools, Vertex AI training	OKE with GPU node pools, Data Science Jobs
Distributed Training Software & Features	EFA for NCCL; SageMaker distributed training libraries	InfiniBand for NCCL; Azure ML distributed jobs	GPUDirect-TCPX; Vertex AI distributed training	RDMA cluster networking for NCCL; Data Science jobs
CUDA & Driver Management Software & Features	Pre-installed NVIDIA drivers in DL AMIs; manual install option	NVIDIA GPU driver extension auto-install on VMs	GPU driver install script or Container-Optimized OS	Pre-installed drivers on GPU platform images
Multi-Instance GPU (MIG) Software & Features	A100 MIG support for partitioning into 7 instances	A100 MIG support on ND A100 v4	A100 MIG support for workload isolation	A100 MIG support on bare metal and VM shapes

How This Tool Works

The compare tool catalogs GPU instance offerings across all four clouds: GPU type (H100, A100, H200, L4, T4, V100), GPU count per instance, GPU memory (40GB/80GB), system RAM, inter-GPU interconnect (NVLink, NVSwitch), inter-node networking (EFA, GPUDirect-TCPX, InfiniBand), local NVMe storage, on-demand and spot pricing per region, and reserved capacity availability. Side-by-side tables surface cost-effective options for specific workloads.

Overview

GPU instances for machine learning training and inference are available across all major clouds, but they differ significantly in GPU types (NVIDIA A100, H100, H200, L4, T4, AMD MI300X), interconnect bandwidth, memory configurations, pricing models, and availability. AWS offers P, G, and Inf instance families, Azure provides NC, ND, and NV series, GCP has A2 and A3 machine types, and OCI offers GPU shapes with flexible OCPUs. This comparison tool helps ML engineers and data scientists find the right GPU instances across clouds based on workload requirements, budget, and availability.

How Engineers Use This

•Comparing H100 and A100 GPU instances across AWS (p5, p4d), Azure (ND H100, ND A100), GCP (a3, a2), and OCI for large-scale training jobs
•Finding the most cost-effective GPU instances for inference workloads that need T4 or L4 GPUs
•Evaluating GPU instance pricing with reserved, spot, and on-demand options across all clouds

A Real Example

Your ML team needs to train a 70B parameter model — workload requires 8x H100 80GB GPUs with high inter-GPU bandwidth. The compare tool shows: AWS p5.48xlarge ($98/hr on-demand, $40/hr spot), Azure ND H100 v5 ($110/hr on-demand, no spot), GCP a3-highgpu-8g ($88/hr on-demand, $35/hr spot), OCI BM.GPU.H100.8 ($82/hr on-demand). For a 30-day training run on spot capacity, GCP wins at ~$25K total vs AWS at $29K vs Azure at $79K (no spot). The team chooses GCP, builds checkpointing into the training loop, and runs the job for 60% less than they originally budgeted on Azure.

Tips & Gotchas

TIP

GPU instance availability is the BIG hidden constraint. H100 instances are perpetually capacity-constrained on AWS and Azure. GCP and OCI have generally better availability for cutting-edge GPUs. For training jobs that need a specific instance type at specific times, plan ahead with Reserved or Capacity Reservations — on-demand H100 in popular regions can have multi-hour provisioning failures.

TIP

Inter-GPU interconnect bandwidth matters MORE than per-GPU FLOPs for distributed training. AWS p5 instances with EFA + 3,200 Gbps inter-node networking outperform similar A100 instances with slower interconnect. For 8-GPU training, NVLink-equipped instances are essential; cross-node training needs RDMA-capable networking.

TIP

Spot/preemptible GPU pricing is the killer cost feature when your workload is checkpoint-able. ML training jobs with checkpoints every 30 minutes can use spot GPUs at 60-70% discount with maybe 10-15% extra walltime due to occasional interruptions. Net effect: same job at half the cost.

Questions & Answers

Which cloud offers the best GPU instance pricing?

Pricing depends on the GPU type and commitment level. For on-demand A100 instances, OCI is often the most affordable with NVIDIA GPU instances at competitive per-hour rates. For spot/preemptible pricing, AWS and GCP typically offer 60-70% discounts but with interruption risk. Azure has limited spot availability for GPU VMs. For reserved commitments (1-3 years), all clouds offer 30-60% savings. The total cost also depends on interconnect — multi-GPU training jobs on AWS p5 instances benefit from EFA networking, while GCP a3 instances use GPUDirect-TCPX. Compare the complete cost including networking and storage for your specific workload.

How do I choose between training and inference GPU instances?

Training workloads need high memory GPUs (A100 80GB, H100 80GB) with fast inter-GPU interconnect (NVLink, NVSwitch) and high network bandwidth (EFA, GPUDirect). Use P5/P4d (AWS), ND H100/ND A100 (Azure), A3/A2 (GCP) for training. Inference workloads prioritize cost per inference and may use smaller GPUs like T4, L4, or specialized accelerators like AWS Inferentia (Inf2). For batch inference with variable demand, consider spot instances. For real-time inference, use on-demand instances or reserved capacity with auto-scaling.

Related Learning Guides

Kubernetes Comparison: EKS vs AKS vs GKE25 min read

Was this tool helpful?

Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.