GPU Cloud Pricing for ML Training: A100 vs H100 Across Clouds

The GPU Cloud Landscape in 2026

GPU compute is the most expensive and most scarce resource in cloud computing. The explosion of large language models, diffusion models, and other AI/ML workloads has created unprecedented demand for GPU instances, particularly those equipped with NVIDIA A100 and H100 GPUs. Cloud providers have responded by expanding GPU capacity, introducing new pricing models, and partnering with NVIDIA on next-generation hardware. But the pricing, availability, and configuration of GPU instances vary dramatically across providers, and the wrong choice can cost tens of thousands of dollars per training run.

This article compares GPU cloud pricing for ML training across AWS, Azure, GCP, and OCI. We cover the most commonly used GPU instance types, their on-demand and spot pricing, reservation options, practical availability, and the total cost of typical training workloads. Whether you are training a small fine-tuning job or a large foundation model, this guide will help you choose the most cost-effective GPU provider for your needs.

NVIDIA A100 Instances

The NVIDIA A100 GPU, based on the Ampere architecture, remains widely available across all four major clouds and is the workhorse for many ML training workloads. Each A100 provides 80 GB of HBM2e memory (the 40 GB variant is being phased out), 312 TFLOPS of FP16 performance with sparsity, and 600 GB/s NVLink bandwidth for multi-GPU communication.

On AWS, A100 instances are available as p4d.24xlarge (8x A100 40GB, $32.77/hour) and p4de.24xlarge (8x A100 80GB, $40.97/hour). These are bare-metal instances with 96 vCPUs, 1,152 GB RAM, and 400 Gbps EFA networking. The high per-hour cost means a 24-hour training run on a p4d.24xlarge costs approximately $787. On-demand availability has improved since 2024 but still requires capacity planning in popular regions. Spot pricing for p4d instances averages 60-70 percent discount when available, but spot capacity is unreliable for long training runs.

On Azure, A100 instances are available as Standard_ND96asr_v4 (8x A100 40GB, approximately $27.20/hour) and Standard_ND96amsr_A100_v4 (8x A100 80GB, approximately $32.77/hour). Azure's A100 pricing is competitive with AWS, and availability has improved with expanded data center capacity. Azure also offers NC A100 v4 series with 1-4 A100 GPUs per instance for smaller training jobs that do not need 8 GPUs.

On GCP, A100 instances are available through the a2-highgpu family. The a2-highgpu-8g instance provides 8x A100 80GB at approximately $29.39/hour. GCP offers the most flexible A100 configurations with 1, 2, 4, 8, or 16 GPU options per instance (the 16-GPU a2-megagpu-16g is a multi-node configuration). Sustained-use discounts apply automatically, reducing the effective cost by up to 30 percent for training runs that span most of a month. Committed-use discounts of up to 57 percent are available for one or three-year commitments.

On OCI, A100 instances are available as BM.GPU.A100-v2.8 (8x A100 80GB, approximately $22.00/hour). OCI's A100 pricing is the lowest among the four providers, reflecting Oracle's aggressive pricing strategy to compete for ML workloads. Availability is more limited than the other three providers, with fewer regions offering GPU instances, but OCI has been expanding GPU capacity rapidly.

A100 cost comparison for 100 GPU-hours

AWS p4d.24xlarge: $409 (on-demand), ~$143 (spot). Azure ND96asr_v4: $340 (on-demand), ~$102 (spot). GCP a2-highgpu-8g: $367 (on-demand), ~$110 (spot). OCI BM.GPU.A100-v2.8: $275 (on-demand). For workloads that can tolerate interruption, Azure and GCP spot pricing offers the best value. For uninterruptible workloads, OCI on-demand pricing is the lowest.

NVIDIA H100 Instances

The NVIDIA H100 GPU, based on the Hopper architecture, delivers approximately 3x the training performance of the A100 for transformer-based models. Each H100 provides 80 GB of HBM3 memory, 990 TFLOPS of FP16 performance with sparsity, and 900 GB/s NVLink bandwidth (or 3.6 TB/s with NVSwitch in 8-GPU configurations). The H100 is the GPU of choice for training large language models, large vision models, and other workloads that benefit from its transformer engine.

On AWS, H100 instances are available as p5.48xlarge (8x H100 80GB, approximately $98.32/hour). These instances include 192 vCPUs, 2,048 GB RAM, and 3,200 Gbps EFA networking. The high per-hour cost is partially offset by the 3x performance improvement over A100 — a training job that takes 24 hours on A100 might complete in 8 hours on H100, making the total cost comparable. P5e and P5en instances with H100 NVL (extended memory) are available in select regions.

On Azure, H100 instances are available as Standard_ND96isr_H100_v5 (8x H100 80GB, approximately $98.32/hour). Azure has invested heavily in H100 capacity, particularly for Azure OpenAI workloads, and availability has been improving. Azure also offers ND H100 v5 instances with InfiniBand networking for multi-node training at cluster scale.

On GCP, H100 instances are available through the a3-highgpu-8g instance type (8x H100 80GB, approximately $98.32/hour). GCP offers A3 Mega instances with 8x H100 and 3,200 Gbps GPU-to-GPU bandwidth for large-scale distributed training. Sustained-use and committed-use discounts apply, with CUDs reducing the effective cost by up to 57 percent for long-term training infrastructure.

On OCI, H100 instances are available as BM.GPU.H100.8 (8x H100 80GB). OCI's H100 pricing is competitive, typically 20-30 percent lower than AWS and Azure on-demand pricing. OCI has been partnering with NVIDIA on GPU clusters for enterprise AI workloads, and availability has been growing. The combination of lower per-hour pricing and free data egress (10 TB/month) makes OCI particularly attractive for training jobs that produce large output artifacts.

Compare GPU instance pricing across clouds

Spot and Preemptible GPU Pricing

Spot pricing for GPU instances can reduce costs by 60-90 percent, but availability is highly variable and interruption rates are much higher for GPU instances than for CPU instances due to the scarcity of GPU capacity. Spot GPU instances are best suited for short training runs (under 4 hours), hyperparameter tuning jobs where individual runs can be checkpointed and restarted, inference workloads that can tolerate brief interruptions, and batch processing of ML pipelines with checkpoint-and-resume logic.

On AWS, spot pricing for p4d.24xlarge instances averages 60-70 percent discount when available. However, spot capacity for GPU instances is often unavailable in popular regions. Spot Fleet with capacity-optimized allocation strategy increases the chances of getting and keeping spot GPU instances. For training jobs, implement checkpointing every 15-30 minutes so that interrupted jobs can resume from the last checkpoint rather than starting over.

On Azure, spot pricing for GPU instances offers similar 60-70 percent discounts. Azure provides a max price option where you can set the maximum you are willing to pay per hour — the instance runs as long as the spot price stays below your max price. The Eviction Type option lets you choose between deallocation (instance is stopped) and deletion (instance is deleted), with deallocation preferred for training jobs that can resume.

On GCP, spot VMs offer 60-91 percent discount on GPU instances. GCP spot instances do not have a maximum 24-hour lifetime (unlike the old preemptible VMs), making them more practical for longer training runs. GCP also provides managed instance groups with spot VMs that automatically recreate instances when they are preempted, which is useful for inference workloads.

On OCI, preemptible GPU instances offer significant discounts but availability is more limited. OCI's capacity-on-demand pricing model provides more predictable costs for GPU workloads that need guaranteed availability.

Reserved GPU Capacity

For teams that run GPU workloads continuously (24/7 training infrastructure, inference serving, research clusters), reserved capacity provides the best long-term pricing. One-year commitments typically offer 30-40 percent discount, and three-year commitments offer 50-60 percent discount compared to on-demand pricing.

On AWS, EC2 Reserved Instances and Compute Savings Plans apply to GPU instances. A one-year all-upfront reservation for a p4d.24xlarge costs approximately $191,000 (compared to $287,000 at on-demand rates), a 33 percent savings. Three-year reservations save up to 60 percent. Capacity Reservations (independent of pricing commitments) can guarantee that GPU instances are available when you need them, which is important given the limited GPU supply.

On Azure, Reserved VM Instances for GPU series provide similar discounts. One-year reservations save approximately 36 percent, and three-year reservations save approximately 57 percent. Azure also offers Azure Machine Learning compute clusters with low-priority (spot) nodes for cost-effective training infrastructure.

On GCP, Committed Use Discounts for GPU instances provide up to 57 percent discount for three-year commitments. GCP also offers GPU quotas that must be requested and approved for each project — verify that your quota is sufficient before purchasing commitments.

On OCI, annual flex pricing through Universal Credits provides volume discounts on GPU compute. For large-scale commitments, OCI's pricing team can negotiate custom rates that are often the most competitive of the four providers.

Networking for Distributed Training

Multi-node distributed training requires high-bandwidth, low-latency networking between GPU instances. The networking capability of your instances often determines whether you can efficiently scale training across multiple nodes or whether communication overhead negates the benefit of additional GPUs.

On AWS, EFA (Elastic Fabric Adapter) provides up to 3,200 Gbps of non-blocking bandwidth on p5 instances and 400 Gbps on p4d instances. EFA supports GPUDirect RDMA, allowing GPU-to-GPU communication across nodes without going through the CPU, which is essential for efficient all-reduce operations in distributed training. On Azure, InfiniBand NDR (400 Gbps) is available on ND H100 v5 instances, providing comparable networking for distributed training. On GCP, GPU-to-GPU networking on A3 instances provides 3,200 Gbps of bandwidth using Google's Jupiter network fabric. On OCI, RDMA networking with 1,600 Gbps of bandwidth is available on BM.GPU.H100.8 instances through cluster networking.

For distributed training frameworks, all four providers support PyTorch DistributedDataParallel (DDP), NVIDIA NCCL for collective communications, DeepSpeed for large model training, and Megatron-LM for model parallelism. The choice of framework depends on your model architecture and scale rather than the cloud provider.

Managed ML Platforms vs. Raw Instances

Each cloud provider offers a managed ML platform (SageMaker, Azure ML, Vertex AI, OCI Data Science) that simplifies GPU resource management for training. These platforms handle instance provisioning, job scheduling, distributed training orchestration, experiment tracking, and model artifact management. The trade-off is typically a 10-20 percent markup on compute costs plus platform service charges.

For teams with dedicated ML infrastructure engineers, running training jobs on raw GPU instances (EC2, Azure VMs, GCE, OCI Compute) with open-source tools (PyTorch Lightning, Weights and Biases, MLflow) provides maximum flexibility and lower costs. For teams without ML infrastructure expertise, managed platforms reduce the operational burden and accelerate time to results, which often justifies the cost premium.

Evaluate managed platforms for: automated hyperparameter tuning (which can efficiently schedule many short GPU jobs), built-in distributed training support, automatic model artifact versioning and deployment, and cost governance features like managed spot training with automatic checkpointing and retry.

Cost optimization for ML training

Start training jobs on spot instances with checkpointing every 15 minutes. Use the cheapest provider (typically OCI on-demand or GCP/Azure spot) for initial experimentation and hyperparameter tuning. Reserve capacity only after your training pipeline is stable and you know your sustained GPU requirements. Right-size your GPU selection — not every training job needs H100s; many fine-tuning and smaller training jobs run perfectly on A100s at half the cost.

AWS Bedrock AI Guide Azure OpenAI Guide GCP Gemini and Vertex AI Guide Compare AI platforms across clouds