Spot and Preemptible Instances: Saving 60-90% Without Getting Burned

The Discount Everyone Knows About But Few Use Well

Every cloud engineer knows that spot instances (AWS), spot VMs (Azure and GCP), and preemptible VMs (GCP legacy) offer 60 to 90 percent discounts compared to on-demand pricing. The concept is simple: you use the cloud provider's spare capacity at a steep discount, and in exchange, the provider can reclaim your instances with little or no notice. AWS gives you a two-minute warning. Azure gives you 30 seconds. GCP Spot VMs give you 30 seconds. GCP Preemptible VMs (the older product) are automatically terminated after 24 hours even if capacity is available.

Despite these dramatic savings, most organizations use spot capacity for less than 10 percent of their compute. The reason is fear: fear of interruptions causing outages, fear of complexity in managing mixed fleets, and fear that the engineering effort required to handle interruptions gracefully outweighs the cost savings. After running production workloads on spot capacity for over seven years across all three major clouds, I can tell you that this fear is largely misplaced -- but only if you use spot instances for the right workloads with the right patterns.

This article covers the workloads that work well on spot, the ones that do not, and the specific patterns that let you capture those 60 to 90 percent savings without sacrificing reliability.

Spot Pricing and Interruption Rates by Cloud

AWS Spot Instances

AWS Spot pricing fluctuates based on supply and demand for each instance type in each Availability Zone. In practice, spot prices for popular instance types like m5.xlarge and c5.2xlarge in US regions are 60 to 70 percent below on-demand pricing. Less popular instance types and regions can see discounts of 80 to 90 percent. AWS publishes a Spot Instance Advisor that shows historical interruption rates by instance type -- most popular types in US regions have interruption rates below 5 percent.

AWS provides a two-minute warning before reclaiming a spot instance, delivered via instance metadata (a poll endpoint) and CloudWatch Events (a push notification). Two minutes is enough time to drain connections from a load balancer, checkpoint a batch job, or cordon a Kubernetes node and evict pods gracefully. It is not enough time to complete a long-running operation, which is why job checkpointing is essential for batch workloads.

AWS also offers Spot Blocks (though they have been deprecated for new reservations) and Capacity Reservations for Spot, which guarantee capacity for a defined duration at a slightly lower discount. For most use cases, standard spot instances with interruption handling are more cost-effective than trying to guarantee capacity.

Azure Spot VMs

Azure Spot VMs offer discounts of up to 90 percent compared to pay-as-you-go pricing. Unlike AWS where spot pricing fluctuates continuously, Azure lets you set a maximum price you are willing to pay. If the current spot price exceeds your maximum, Azure evicts your VM. You can set the maximum to -1 to use the on-demand price as your cap, meaning you will only be evicted for capacity reasons, not price reasons. This simplifies the pricing model compared to AWS.

Azure's eviction notice is 30 seconds, delivered through the Azure Metadata Service. This is significantly less than AWS's two minutes, which means your interruption handling must be faster. For Kubernetes workloads, this is usually fine because pod eviction and rescheduling can start within seconds. For standalone VMs running batch jobs, 30 seconds may not be enough to checkpoint state, so you need to checkpoint periodically during execution rather than only when the eviction notice arrives.

Azure also has an important limitation: Spot VMs cannot be resized. If you need a different instance size, you must delete the spot VM and create a new one. This affects auto-scaling strategies that rely on vertical scaling. For horizontal scaling with spot VMs, this is not an issue.

GCP Spot VMs (and Preemptible VMs)

GCP consolidated its spot capacity offerings under the "Spot VM" product, which replaced the older Preemptible VM product. Spot VMs offer 60 to 91 percent discounts and, unlike the old Preemptible VMs, do not have a 24-hour maximum lifetime. They can run indefinitely as long as capacity is available.

GCP provides a 30-second preemption notice via the instance metadata endpoint. GCP also supports a shutdown script that runs during the preemption grace period, giving you a hook to perform cleanup. The shutdown script approach is simpler than polling the metadata endpoint in many cases, especially for batch jobs that just need to save state to GCS before termination.

One GCP-specific advantage: Spot VMs support live migration for certain machine types, which means GCP can move your spot VM to different hardware without terminating it. This reduces the effective interruption rate compared to AWS and Azure, where capacity reclamation always means instance termination.

Diversify instance types

The single most effective strategy for reducing spot interruptions is diversifying across multiple instance types and availability zones. On AWS, use a mixed instances policy with 5 to 10 instance types that meet your performance requirements. Instead of requesting only c5.2xlarge, request c5.2xlarge, c5a.2xlarge, c5n.2xlarge, m5.2xlarge, and m5a.2xlarge. Each type draws from a different capacity pool, so an interruption in one pool does not affect the others. AWS EC2 Fleet and Auto Scaling Group mixed instance policies make this straightforward.

Pattern 1: Batch Processing and Data Pipelines

Batch processing is the classic spot instance use case and the one with the highest ROI. Batch jobs are inherently tolerant of interruptions because they process discrete chunks of work that can be retried independently.

The key requirement is checkpointing. A batch job that processes 10 million records should checkpoint its progress every 100,000 records (or every few minutes, whichever is less). When an interruption occurs, the job resumes from the last checkpoint instead of starting over. For ETL pipelines on AWS, EMR Spark clusters on spot instances with S3-based checkpointing achieve 70 to 80 percent cost reduction compared to on-demand clusters with minimal impact on total processing time.

Real numbers: A daily ETL pipeline running on a 20-node EMR cluster with r5.2xlarge instances takes 3 hours to complete. On-demand cost: 20 nodes * $0.504/hr * 3 hours = $30.24 per run, or $907 per month. On spot: 20 nodes * $0.151/hr * 3.5 hours (accounting for occasional interruptions and restarts) = $10.57 per run, or $317 per month. That is a $590-per-month saving on a single pipeline. Across 10 pipelines, you are saving $5,900 per month.

Apache Spark on Spot

Spark is particularly well-suited for spot instances because its architecture inherently supports task retry. If a spot instance running Spark executors is interrupted, the driver reassigns the lost tasks to remaining executors. The key configuration is using spot for executors only while keeping the driver on an on-demand instance. Losing the driver kills the entire job, so the driver must be reliable. Losing executors just slows down the job.

On AWS EMR, configure the core node group as on-demand (this hosts HDFS and the driver) and the task node group as spot with multiple instance types. On GCP Dataproc, use standard VMs for the master and primary workers, and Spot VMs for secondary workers. On Azure HDInsight, use on-demand for head nodes and spot for worker nodes.

Pattern 2: CI/CD Build Runners

CI/CD build runners are ideal for spot because builds are short-lived (typically 5 to 20 minutes), stateless (no persistent data on the runner), and tolerant of retries (a failed build due to instance interruption just needs to be re-triggered).

GitHub Actions self-hosted runners on spot: Use a Kubernetes cluster with a spot node pool running the Actions Runner Controller (ARC). When a workflow is triggered, ARC schedules a runner pod on a spot node. If the node is interrupted during a build, the pod is evicted and the GitHub Actions job times out and is automatically retried. The retry adds 5 to 10 minutes of delay, but this happens on less than 5 percent of builds in practice.

GitLab CI runners on spot: Use the GitLab Runner autoscaler with the Docker Machine or Instance executor configured to use spot instances. GitLab Runner handles instance provisioning and cleanup automatically. Set the MaxBuilds configuration to 1 so each instance runs only one build, eliminating state leakage between builds.

Real numbers: A team of 30 engineers running 200 builds per day, each requiring a c5.2xlarge for 15 minutes. On-demand: 200 builds * 0.25 hours * $0.34/hr = $17 per day, or $510 per month. On spot: 200 builds * 0.25 hours * $0.10/hr = $5 per day, or $150 per month. Plus occasional retries (10 builds at worst): $1.50. Total savings: $360 per month. Not life-changing for one team, but across an organization with 10 teams, that is $3,600 per month.

Do not use spot for deployment runners

Use spot for build and test runners, not for deployment runners. A spot interruption during a build wastes time but causes no damage -- you just rebuild. A spot interruption during a deployment can leave your infrastructure in a partially-applied state that requires manual intervention to fix. Keep deployment runners on on-demand instances. The cost difference is negligible compared to the risk.

Pattern 3: Kubernetes Node Pools

Kubernetes is the best platform for running production workloads on spot instances because its scheduler, pod disruption budgets, and horizontal pod autoscaler provide the building blocks for graceful interruption handling.

Mixed Node Pool Strategy

The standard pattern is three node pools: a small on-demand pool for system components (kube-system, monitoring, ingress controllers), a medium on-demand pool for stateful workloads (databases, caches, persistent queues), and a large spot pool for stateless application workloads. The on-demand pools provide a reliability baseline, and the spot pool provides cost-effective capacity for the bulk of your workloads.

Use node labels and taints to control which pods run on which pools. Taint the on-demand system pool with dedicated=system:NoSchedule so only system pods with a matching toleration are scheduled there. Taint the on-demand stateful pool with dedicated=stateful:NoSchedule. Leave the spot pool untainted so all pods can schedule there by default, but add a kubernetes.io/lifecycle=spot label for pod affinity rules.

Pod Disruption Budgets

Pod Disruption Budgets (PDBs) are essential for spot workloads. A PDB tells Kubernetes how many pods of a deployment can be unavailable simultaneously. For a deployment with 5 replicas and a PDB of minAvailable: 4, Kubernetes will only evict one pod at a time during a spot interruption, ensuring the service remains available.

The critical detail: PDBs only work with voluntary disruptions (Kubernetes-initiated evictions). A spot instance termination is an involuntary disruption -- the node is removed regardless of PDBs. PDBs protect you during the graceful drain that happens when the spot termination notice triggers a node cordon and drain, not during an abrupt termination. This means you need to configure the node termination handler (AWS Node Termination Handler, Azure AKS spot handler, or GKE node auto-provisioning) to cordon the node and begin draining pods as soon as the spot termination notice is received.

Overprovisioning for Rescheduling Speed

When a spot node is terminated, all pods on that node need to be rescheduled to other nodes. If there is no available capacity, the pods enter Pending state and wait for the cluster autoscaler to provision a new node, which takes 2 to 5 minutes. For production services with strict availability requirements, this rescheduling delay is unacceptable.

The solution is overprovisioning: keep a buffer of unused capacity in the cluster so that evicted pods can be scheduled immediately on existing nodes. The simplest approach is to deploy a low-priority "placeholder" deployment that requests resources but does nothing. When real pods need to be rescheduled, they preempt the placeholder pods, which frees up capacity instantly. The placeholder pods then trigger the autoscaler to provision new buffer capacity.

Real numbers: For a 30-node spot cluster with 3 to 5 percent interruption rates, maintaining a 2-node overprovision buffer costs roughly $100 to $200 per month but eliminates rescheduling delays for 95 percent of interruption events. This is a small price for service continuity.

EC2 Instance Match Helper

Pattern 4: Web Application Backends

Running stateless web application backends on spot is achievable but requires more careful architecture than batch processing or CI/CD runners. The key requirements are:

Stateless design. No local state -- sessions in Redis or DynamoDB, file uploads directly to S3, configuration from environment variables or a config service.
Multiple replicas. Run at least 3 replicas across multiple AZs. If one spot instance is terminated, the remaining replicas handle traffic while the replacement starts.
Health check integration. The load balancer must stop sending traffic to a draining instance before it terminates. Configure connection draining timeout shorter than the spot termination notice window (under 2 minutes for AWS, under 30 seconds for Azure and GCP).
Mixed instance strategy. Use a mix of spot and on-demand instances. A common ratio is 70 percent spot and 30 percent on-demand. The on-demand instances provide a baseline capacity that survives even a total spot interruption event (which is rare but possible during major capacity crunches).

Real numbers: A web API running on 10 c5.xlarge instances. On-demand: 10 * $0.17/hr * 730 hours = $1,241 per month. Mixed fleet (3 on-demand, 7 spot): (3 * $0.17 + 7 * $0.051) * 730 = $633 per month. That is a 49 percent reduction with no measurable impact on availability. If you are comfortable with the risk, going to 2 on-demand and 8 spot saves even more.

Workloads You Should NOT Run on Spot

Spot instances are not universally applicable. These workloads should stay on on-demand or reserved capacity:

Databases and stateful services. RDS, Elasticsearch, Redis, Kafka brokers. Losing a database instance causes data loss risk and extended recovery times. The savings are not worth the operational risk.
Single-instance workloads. If you have exactly one instance running a service with no failover, a spot interruption means downtime. Spot requires redundancy to work safely.
Long-running, non-checkpointable jobs. A 12-hour ML training job that cannot checkpoint and resume is a terrible fit for spot. A spot interruption at hour 11 wastes 11 hours of compute and requires starting over.
Latency-sensitive real-time services. While spot instances perform identically to on-demand when running, the interruption and rescheduling process can cause latency spikes. For services with strict P99 latency SLAs, the occasional disruption may violate your SLO.
Control plane components. Kubernetes masters, CI/CD servers (Jenkins, GitLab), monitoring systems (Prometheus, Grafana). These are the systems that manage your spot instances -- they need to be running when spot interruptions happen.

Savings Plan plus Spot is the optimal combination

The best cost optimization strategy combines Savings Plans (or Reserved Instances) for your baseline capacity with spot instances for variable capacity. Size your Savings Plan commitment to cover your minimum 24/7 workload -- the on-demand instances in your mixed fleet, your databases, your control plane. Then use spot for everything above that baseline. This gives you committed discount pricing for the capacity you always need and spot pricing for the capacity you only sometimes need. In practice, this combination achieves 50 to 70 percent overall compute cost reduction compared to purely on-demand.

Interruption Handling Implementation Checklist

Instance metadata polling. On AWS, poll http://169.254.169.254/latest/meta-data/spot/instance-action every 5 seconds. On Azure, poll the Scheduled Events endpoint. On GCP, poll the maintenance event metadata.
EventBridge or equivalent. On AWS, configure an EventBridge rule for EC2 Spot Instance Interruption Warning events. This provides a push notification in addition to the metadata poll, reducing the delay between notice and reaction.
Graceful shutdown handler. Register a SIGTERM handler in your application that stops accepting new requests, completes in-flight requests, flushes buffers, and checkpoints state. Test this handler independently -- do not wait for your first spot interruption to discover it does not work.
Connection draining. Configure your load balancer's deregistration delay to be shorter than the spot termination notice. On AWS ALB, set the deregistration delay to 90 seconds (under the 2-minute warning). On Azure, set it to 15 seconds.
Node termination handler (Kubernetes). Deploy the AWS Node Termination Handler, the AKS Spot Handler (built into AKS), or configure GKE's graceful node shutdown. These components automatically cordon the node and begin pod eviction when a spot termination notice is received.
Test interruptions regularly. On AWS, use FIS (Fault Injection Simulator) to simulate spot interruptions. On GCP, stop a spot VM manually. Verify that your application handles the interruption gracefully, that pods reschedule correctly, and that no requests are dropped.

Related Tools

EC2 Instance Match Helper -- Find equivalent instance types for spot diversification
Multi-Cloud VM Comparison -- Compare VM pricing across clouds including spot pricing