Skip to main content
All articles

Kubernetes Resource Limits and Requests: The Guide Nobody Gave You

CPU vs memory requests and limits, QoS classes, OOMKill vs CPU throttling, VPA vs HPA, and a practical tuning methodology with real production numbers.

CloudToolStack TeamMarch 22, 202614 min read

The Resource Configuration Problem Nobody Warns You About

Every Kubernetes tutorial shows you how to set resource requests and limits. Almost none of them explain what actually happens when you get them wrong. I have spent the past six years running production Kubernetes clusters across EKS, AKS, and GKE, and I can tell you that misconfigured resource requests and limits are the single most common cause of application instability in Kubernetes environments. Not bad code. Not network issues. Resource misconfiguration.

The symptoms are subtle and frustrating. Pods get evicted during peak traffic even though the node has available memory. CPU-intensive workloads slow to a crawl while the node reports 40 percent utilization. The scheduler refuses to place pods on nodes that clearly have capacity. Autoscaling triggers too late or not at all. If any of this sounds familiar, your resource configuration is almost certainly wrong.

This guide covers what requests and limits actually do at the kernel level, how QoS classes affect pod eviction order, why CPU throttling is more dangerous than OOMKill in practice, and a concrete methodology for tuning resource configurations based on real workload data.

Requests vs Limits: What Actually Happens

A resource request is a guarantee. When you set a CPU request of 500m and a memory request of 256Mi, you are telling the Kubernetes scheduler that this pod needs at least that much capacity on whatever node it lands on. The scheduler uses requests for bin packing -- it adds up all the requests of pods on a node and will not schedule a new pod if the total would exceed the node's allocatable capacity.

A resource limit is a ceiling. When you set a CPU limit of 1000m and a memory limit of 512Mi, you are telling the kubelet to enforce that the container never exceeds those values. But the enforcement mechanism is completely different for CPU and memory, and this is where most people get confused.

CPU: Throttling, Not Killing

CPU is a compressible resource. When a container hits its CPU limit, the kernel does not kill it. Instead, the CFS (Completely Fair Scheduler) throttles the container by limiting its time on the CPU. The container's processes are paused for the remainder of their scheduling period, then allowed to run again in the next period. The default CFS period is 100ms, so if your limit is 500m (half a core), your container gets 50ms of CPU time per 100ms period.

Here is where it gets painful. Throttling affects latency in ways that are hard to diagnose. A Java application with a 1-core CPU limit might process requests fine at low traffic, but during a spike, request latency increases from 50ms to 800ms. The CPU utilization metric shows 90 percent -- not 100 percent -- because the throttling happens at a granularity that utilization metrics do not capture well. You need to look at the container_cpu_cfs_throttled_seconds_total metric in Prometheus to actually see throttling.

The practical impact: I have seen a production Go service where removing the CPU limit entirely (while keeping the request) reduced p99 latency from 230ms to 45ms. The service was being throttled during garbage collection pauses, which created cascading latency spikes across downstream services.

Memory: OOMKill, No Second Chances

Memory is incompressible. When a container exceeds its memory limit, the kernel's OOM killer terminates the process immediately. There is no graceful degradation. The container gets a SIGKILL, and Kubernetes restarts it according to the pod's restart policy. If this happens repeatedly, the pod enters CrashLoopBackOff.

Memory limits are more straightforward to reason about than CPU limits, but they have a subtle interaction with the Go runtime, JVM, and other managed runtimes. A Java application running in a container with a 512Mi memory limit will not automatically set its heap to fit within that limit. You need to pass -XX:MaxRAMPercentage=75 (or a similar flag) to tell the JVM to use 75 percent of the container's memory limit for the heap, leaving room for non-heap memory, thread stacks, and the OS.

Go applications have a similar issue. The Go garbage collector uses the GOMEMLIMIT environment variable (available since Go 1.19) to set a soft memory limit. Without it, the GC may not run aggressively enough to stay within the container's memory limit, leading to OOMKills that look random but are actually predictable.

The JVM memory trap

If you run Java in Kubernetes without setting -XX:MaxRAMPercentage or -Xmx, the JVM will default to using 25 percent of the host node's memory for the heap -- not the container's memory limit. On a 64 GB node, that means the JVM will try to allocate 16 GB of heap inside a container with a 1 GB memory limit. The container will be OOMKilled immediately. Always set explicit JVM memory flags in containerized applications.

QoS Classes: The Eviction Pecking Order

Kubernetes assigns every pod a Quality of Service class based on its resource configuration. This class determines the order in which pods are evicted when a node runs low on resources. There are three classes, and understanding them is critical for production reliability.

Guaranteed (requests == limits for all containers)

When every container in a pod has identical requests and limits for both CPU and memory, the pod gets the Guaranteed QoS class. These pods are the last to be evicted when a node is under memory pressure. They are also the pods that get the most predictable performance because the kernel reserves their resources and enforces strict limits. Use Guaranteed QoS for your most critical workloads: databases, stateful services, and anything where eviction causes significant recovery time.

Burstable (requests < limits, or only requests set)

If at least one container in a pod has requests that differ from its limits, the pod gets the Burstable QoS class. These pods can use resources beyond their requests (up to their limits) when the node has spare capacity, but they are evicted before Guaranteed pods when the node is under pressure. Most application workloads should be Burstable. It allows efficient resource sharing during normal operation while providing a safety net against runaway resource consumption.

BestEffort (no requests or limits set)

Pods with no resource requests or limits at all get the BestEffort QoS class. These pods are the first to be evicted under memory pressure, and the scheduler does not account for them during bin packing. BestEffort is appropriate only for batch jobs that can tolerate interruption, like log processing or background analytics.

The practical QoS strategy

For most production clusters, I recommend this approach: set memory requests equal to memory limits (to avoid OOMKills and get predictable eviction behavior), set CPU requests based on actual usage data, and either set CPU limits generously or remove them entirely. This gives you Burstable QoS with predictable memory behavior and flexible CPU usage. The only exception is databases and stateful workloads, which should be Guaranteed.

The Numbers: Request/Limit Ratios and Bin Packing

Let me walk through what happens with different request/limit ratios on a real cluster. Consider a 3-node cluster where each node has 4 vCPUs and 16 GB of memory (typical for m5.xlarge or e2-standard-4 instances).

Scenario 1: Requests == Limits (Guaranteed)

If every pod requests 500m CPU and 512Mi memory with identical limits, you can fit roughly 7 pods per node (accounting for system overhead and DaemonSets). That is 21 pods across 3 nodes. CPU utilization during normal traffic might be 30 to 40 percent because the pods are not using their full requests most of the time, but you cannot schedule more pods because the scheduler sees the node as full. Cluster cost efficiency: low. Reliability: high.

Scenario 2: Low Requests, High Limits (Overcommitted)

If pods request 100m CPU and 128Mi memory but have limits of 1000m CPU and 1Gi memory, you can schedule around 30 pods per node based on requests. That is 90 pods across 3 nodes. But if all 90 pods actually try to use their limits simultaneously, the nodes need 90 cores and 90 GB of memory -- far more than the cluster has. During traffic spikes, you will see aggressive throttling, OOMKills, and pod evictions. Cluster cost efficiency: appears high until something breaks. Reliability: poor under load.

Scenario 3: The Sweet Spot (2:1 to 3:1 Ratio)

Based on data from clusters I have managed, a 2:1 to 3:1 ratio between limits and requests works well for most web services. Request 250m CPU and 256Mi memory with limits of 500m to 750m CPU and 512Mi to 768Mi memory. This lets you schedule about 14 pods per node (42 across 3 nodes) while providing headroom for traffic spikes. The key is that most web services are idle 60 to 80 percent of the time, so the total burst capacity of all pods will rarely be needed simultaneously.

VPA vs HPA: Choosing the Right Autoscaler

The Horizontal Pod Autoscaler (HPA) adds more pods when metrics exceed a threshold. The Vertical Pod Autoscaler (VPA) adjusts the resource requests and limits of existing pods. They solve different problems, and using the wrong one creates more issues than it solves.

When to Use HPA

HPA is the right choice for stateless workloads where adding replicas improves throughput linearly. Web servers, API services, queue consumers, and most microservices should use HPA. Configure HPA to scale on CPU utilization (target 60 to 70 percent of the request, not the limit) or on custom metrics like request rate or queue depth. Avoid scaling on memory utilization -- memory usage in garbage-collected languages does not correlate well with load, and you will get oscillating scaling behavior.

When to Use VPA

VPA is the right choice for workloads where you do not know the right resource requests, or where resource usage patterns change over time. It is particularly useful for stateful workloads where horizontal scaling is difficult -- databases, caches, and single-instance batch processors. VPA observes actual resource usage over time and recommends (or automatically applies) updated requests and limits.

The biggest VPA gotcha: in its default mode (Auto), VPA restarts pods to apply new resource values. For stateful workloads, this means brief downtime during the update. Use the "Off" or "Initial" mode to get recommendations without automatic restarts, then apply changes during maintenance windows. Kubernetes 1.27 introduced in-place pod resize as an alpha feature, which will eventually let VPA update resources without restarts, but it is not production-ready yet.

VPA and HPA together

You can run VPA and HPA on the same workload, but they must not scale on the same metric. Configure HPA to scale on custom metrics (request rate, queue depth) and VPA to manage CPU and memory requests. If both try to manage CPU, they will fight each other -- HPA adds pods while VPA increases per-pod resources, leading to unpredictable behavior. The Multidimensional Pod Autoscaler (MPA) in GKE Autopilot handles this coordination automatically.

A Practical Tuning Methodology

Stop guessing at resource values. Here is a step-by-step methodology that works in production.

Step 1: Deploy with VPA in Recommendation Mode

Create a VPA resource in "Off" mode for each workload. Let it observe resource usage for at least 7 days, ideally including both normal traffic and peak periods (weekend vs weekday, batch processing windows, etc.). VPA will produce target, lower bound, and upper bound recommendations for both CPU and memory.

Step 2: Set Requests to the VPA Target

Use the VPA target recommendation as your resource request. This represents the amount of resources the workload typically needs. Add a 10 to 20 percent buffer for safety. Do not use the lower bound -- it represents the minimum needed to avoid OOMKill, with no headroom for normal variance.

Step 3: Set Memory Limits Equal to Requests (or 1.2x)

For memory, set the limit equal to the request or at most 1.2x the request. Memory overcommit is dangerous because the recovery mechanism (OOMKill) is disruptive. If you find that memory usage occasionally spikes, increase the request rather than widening the gap between request and limit.

Step 4: Set CPU Limits to 2-3x Requests (or Remove Them)

For CPU, either set limits to 2 to 3x the request or remove them entirely. The argument for removing CPU limits is strong: CPU is compressible, throttling hurts latency more than contention does, and the request already guarantees your minimum share of the CPU. Google internally does not set CPU limits on most workloads, and GKE Autopilot does not support CPU limits at all. If you do set CPU limits, monitor container_cpu_cfs_throttled_seconds_total and increase limits if you see significant throttling.

Step 5: Monitor and Iterate

Track these metrics weekly: CPU request utilization (actual usage / request), memory request utilization, CPU throttling rate, OOMKill count, and pod eviction count. Aim for 60 to 80 percent request utilization on CPU and 70 to 90 percent on memory. Below 50 percent means you are wasting capacity. Above 90 percent means you do not have enough headroom for spikes.

Common Mistakes and How to Fix Them

Mistake 1: Copying Requests from Stack Overflow

I regularly see teams copy resource values from blog posts or Stack Overflow answers without adjusting for their workload. A Python Flask API and a Java Spring Boot monolith have completely different resource profiles. The Flask app might need 100m CPU and 128Mi memory; the Spring Boot app might need 500m CPU and 1Gi memory just to start. Always measure your actual workload.

Mistake 2: Setting Requests Based on Peak Usage

If your service peaks at 800m CPU during the morning traffic spike but uses 200m CPU the rest of the day, setting the request to 800m wastes 75 percent of the reserved capacity for 22 out of 24 hours. Set the request to the typical usage (200 to 300m) and let HPA add pods during the spike, or set a higher limit to handle the burst on a single pod.

Mistake 3: Ignoring Init Container Resources

Init containers run before the main containers and can have their own resource requirements. A common pattern is using an init container to run database migrations, which might need significantly more memory than the main application. If the init container's resource requests are higher than the main container's, the scheduler must reserve the higher amount during scheduling, even though the init container runs only briefly.

Mistake 4: Not Accounting for DaemonSet Overhead

Every node runs DaemonSet pods for logging agents, monitoring agents, CNI plugins, and kube-proxy. On a typical production cluster, DaemonSets consume 500m to 1500m CPU and 512Mi to 2Gi memory per node. If you plan your node capacity without accounting for DaemonSet overhead, you will consistently have less schedulable capacity than you expect. Check your DaemonSet resource requests with kubectl get daemonsets -A -o json | jq '.items[].spec.template.spec.containers[].resources'.

Cluster-Level Resource Planning

Beyond individual pod configuration, cluster-level resource planning determines your overall cost efficiency. Here are the numbers that matter.

Node allocatable capacity: A node with 4 vCPUs does not have 4 full cores available for pods. The kubelet reserves capacity for system daemons (typically 6 to 10 percent of CPU and memory) and eviction thresholds (default 100Mi memory). On a 4-core, 16 GB node, expect roughly 3.6 to 3.8 allocatable cores and 14.5 to 15 GB of allocatable memory.

Target cluster utilization: For clusters without autoscaling, aim for 65 to 75 percent request-to-allocatable ratio. This provides enough headroom to reschedule pods if a node fails. For clusters with Cluster Autoscaler or Karpenter, you can push to 80 to 85 percent because new nodes will be provisioned when capacity is needed.

Node size matters: Larger nodes have less relative overhead from DaemonSets and system reservations. On a 2-core node, DaemonSets might consume 25 percent of capacity. On an 8-core node, they might consume 8 percent. Larger nodes also improve bin packing efficiency because the scheduler has more flexibility in placing pods. The tradeoff is blast radius -- losing a 64-core node displaces more pods than losing a 4-core node.

Estimate GKE cluster costs with different node configurationsCalculate AKS costs including node pools and autoscaling

Final Recommendations

If you take away three things from this guide, let them be these. First, always set memory requests equal to or very close to memory limits -- memory overcommit is the most common cause of unexpected pod evictions in production. Second, seriously consider removing CPU limits entirely for stateless workloads, and monitor CFS throttling for any workload that keeps them. Third, never guess at resource values -- deploy with VPA in observation mode, collect at least a week of data, and set requests based on actual usage patterns.

Resource configuration is not a set-and-forget exercise. Workloads change over time as features are added, traffic patterns shift, and dependencies are updated. Build resource monitoring into your operational reviews and adjust requests quarterly at minimum. The clusters that run most efficiently are the ones where someone is actively watching resource utilization metrics and asking whether the current configuration still matches reality.

Written by CloudToolStack Team

Cloud architects with 15+ years of production experience across AWS, Azure, GCP, and OCI. We build free tools and write practical guides to help engineers navigate multi-cloud infrastructure.

Disclaimer: This article is for informational purposes. Cloud services and pricing change frequently; always verify with official provider documentation. AWS, Azure, GCP, and OCI are trademarks of their respective owners.