Skip to main content
All articles

API Rate Limiting Patterns: Token Bucket, Sliding Window, and Cloud Implementation

Cover token bucket, sliding window, and fixed window algorithms, cloud API gateway rate limiting across AWS, Azure, and GCP, WAF rate rules, and client-side retry strategies.

CloudToolStack TeamFebruary 21, 202613 min read

Why Rate Limiting Is a First-Class Architectural Concern

Rate limiting used to be something you bolted on after launch when a misbehaving client started hammering your API. That approach stopped working around the time every application became a distributed system of microservices, third-party integrations, and event-driven workflows all making API calls at machine speed. In 2026, rate limiting is not a nice-to-have defensive measure -- it is a core architectural decision that affects your cost model, your reliability guarantees, and your ability to offer fair multi-tenant service.

I have seen rate limiting failures cause everything from unexpected five-figure cloud bills to cascading outages that brought down services that were not even related to the overloaded endpoint. The problem is almost never that teams forget to implement rate limiting. It is that they implement the wrong algorithm for their use case, configure thresholds based on guesswork instead of measurement, or ignore the client-side retry behavior that determines whether rate limiting actually protects you or just delays the inevitable overload.

This article covers the three most important rate limiting algorithms, how they map to cloud API gateway implementations, and the client-side patterns that make or break your rate limiting strategy in production.

The Three Algorithms You Actually Need to Know

There are dozens of rate limiting algorithms in the academic literature, but in practice, cloud infrastructure uses three: fixed window, sliding window, and token bucket. Understanding the tradeoffs between them is the difference between rate limiting that works and rate limiting that creates new problems.

Fixed Window

The simplest approach: divide time into fixed intervals (say, one minute), count requests in each interval, and reject requests once the count exceeds the limit. A client allowed 100 requests per minute gets their counter reset to zero at the start of every minute.

Fixed window is easy to implement and easy to reason about, but it has a well-known edge case: the boundary problem. A client can send 100 requests at 11:59:59 and another 100 at 12:00:00, effectively getting 200 requests in a two-second window. For many use cases this is fine. For endpoints where a burst of 2x the limit could cause problems -- say, a payment processing API or a resource provisioning endpoint -- it is not.

AWS API Gateway's throttling uses a variation of fixed window for its per-second rate limits. Azure API Management's rate-limit policy uses fixed window counting. If you are configuring either of these, be aware of the boundary burst behavior.

Sliding Window

Sliding window eliminates the boundary problem by considering a rolling time period. Instead of resetting the counter at fixed intervals, the algorithm looks at the number of requests in the past N seconds from the current moment. A 100-requests-per-minute sliding window at 12:00:30 counts all requests from 11:59:30 to 12:00:30.

The tradeoff is implementation complexity. A true sliding window requires storing individual request timestamps, which uses more memory than a simple counter. In practice, most implementations use a sliding window log (storing timestamps and pruning old entries) or a sliding window counter (a weighted combination of the current and previous fixed windows). The sliding window counter approximation is what most cloud services actually use because it provides nearly the same burst protection with minimal overhead.

GCP Cloud Endpoints and Azure API Management's rate-limit-by-key policy both support sliding window behavior. If you are building a multi-tenant SaaS API where fair usage between tenants matters, sliding window is the right choice for per-tenant limits.

Token Bucket

Token bucket is the most flexible algorithm and the one I recommend for most production API rate limiting. Imagine a bucket that holds a maximum number of tokens (the burst capacity). Tokens are added to the bucket at a steady rate (the refill rate). Each request consumes one token. If the bucket is empty, the request is rejected or queued.

The elegance of token bucket is that it naturally handles two different concerns: sustained throughput (controlled by the refill rate) and burst capacity (controlled by the bucket size). A token bucket with a refill rate of 10 tokens per second and a bucket size of 100 allows a sustained rate of 10 requests per second but permits short bursts of up to 100 requests. This maps perfectly to real-world API traffic patterns where clients send requests in bursts but average out to a steady rate over time.

AWS API Gateway uses token bucket for its REST API throttling. The rateLimit setting is the refill rate and the burstLimit is the bucket size. Understanding this model is essential for configuring API Gateway throttling correctly -- I have seen teams set a burst limit equal to their rate limit, effectively eliminating burst capacity and causing unnecessary throttling during normal traffic spikes.

Choosing the right algorithm

Use fixed window when simplicity matters and boundary bursts are acceptable (internal APIs, non-critical endpoints). Use sliding window when you need fair per-tenant or per-client limiting without burst edge cases (multi-tenant SaaS, public APIs with usage tiers). Use token bucket when you need to allow controlled bursts while enforcing a sustained rate (most production APIs, API gateways, load balancers).

Cloud API Gateway Rate Limiting in Practice

Each cloud provider implements rate limiting differently in their API gateway products. The configuration options, default behaviors, and failure modes vary significantly, and understanding these differences is critical if you operate across multiple clouds or are choosing a gateway for a new project.

AWS API Gateway

AWS API Gateway offers two flavors: REST APIs and HTTP APIs. REST APIs provide the most granular rate limiting with per-method, per-stage, and per-API key throttling. HTTP APIs are cheaper but offer only route-level throttling. Both use the token bucket algorithm.

The default account-level limit is 10,000 requests per second with a burst of 5,000. These are regional soft limits that you can increase via a support request, but the per-route and per-client limits you configure are what matter in practice. A common mistake is setting up usage plans and API keys for rate limiting but forgetting that the account-level throttle applies first. If you have 50 clients each allowed 500 requests per second, but your account limit is 10,000, the 21st concurrent client burst will get throttled by the account limit even though their individual limit has not been reached.

REST API throttling returns a 429 status code with a Retry-After header. HTTP APIs return 429 but without the Retry-After header, which means your clients need to implement their own backoff logic. This distinction matters more than you might think -- I have debugged production issues where a client library expected a Retry-After header, did not get one, and fell back to an aggressive fixed retry interval that made the throttling worse.

AWS API Gateway Cost Estimator

Azure API Management

Azure API Management (APIM) takes a policy-based approach to rate limiting. You define rate limiting rules as XML policies that can be applied at the product, API, or operation level. APIM offers two rate limiting policies: rate-limit (fixed count per time period) and rate-limit-by-key (dynamic limiting based on request attributes like IP, subscription key, or custom headers).

The rate-limit-by-key policy is particularly powerful for multi-tenant APIs. You can limit by subscription key for paying customers and by IP address for anonymous traffic, all in the same policy. The counter-key expression supports C# expressions, so you can build complex limiting logic like "100 requests per minute per tenant, but premium tenants get 1,000."

One gotcha with APIM rate limiting: in multi-region deployments, counters are local to each gateway instance by default. If you have APIM deployed in East US and West Europe, a client can get 2x the configured limit by splitting requests across regions. The rate-limit policy has no built-in distributed counter option. You need to use an external cache (Redis) with the rate-limit-by-key policy and a shared cache backend to get globally consistent rate limiting. This is documented but easy to miss.

Azure API Management Policy Builder

GCP API Gateway and Cloud Endpoints

GCP offers API Gateway (managed, serverless) and Cloud Endpoints (runs on your infrastructure). Both support rate limiting through service control quotas defined in the OpenAPI specification. You set quotas per metric (requests, bytes, custom dimensions) and per consumer (API key or service account).

GCP's approach is more declarative than AWS or Azure. You define quota limits in your OpenAPI spec, and the gateway enforces them automatically. The advantage is that rate limits are version-controlled alongside your API definition. The disadvantage is that changes require redeployment of the API configuration -- you cannot adjust limits on the fly through the console like you can with AWS usage plans or Azure APIM policies.

Cloud Armor, GCP's WAF and DDoS protection service, adds another layer of rate limiting at the edge. Cloud Armor rate limiting operates on L7 attributes (IP, headers, path) and uses a token bucket algorithm. You can combine Cloud Armor rate limiting at the edge with API Gateway quotas at the application level for defense in depth.

WAF Rate Rules and DDoS Protection

API gateway rate limiting protects your application from legitimate but excessive traffic. WAF rate rules protect against malicious traffic -- credential stuffing, API scraping, application-layer DDoS attacks. These are different threats that require different rate limiting configurations, and confusing the two leads to either inadequate protection or excessive false positives.

AWS WAF Rate-Based Rules

AWS WAF rate-based rules evaluate request rates over a five-minute rolling window. The minimum threshold is 100 requests per five minutes (effectively 0.33 requests per second). When the threshold is exceeded, the source IP is blocked for the remainder of the five-minute window. You can scope rules to specific URI paths, headers, or query strings to create targeted protections -- for example, limiting login attempts to 20 per five minutes while allowing higher rates on read-only endpoints.

A pattern I use frequently: combine a WAF rate-based rule at a low threshold (say, 2,000 requests per five minutes from a single IP) as a baseline DDoS protection layer with API Gateway per-key throttling at higher thresholds for legitimate API consumers. The WAF rule catches attackers and scrapers. The API Gateway throttling enforces fair usage among authenticated clients. They serve different purposes and should be configured independently.

Azure WAF and Front Door Rate Limiting

Azure Front Door's rate limiting operates at the edge and can block traffic before it reaches your API Management instance or application. Rate limiting rules in Azure WAF can match on client IP, socket address, or geo-location. The evaluation window is either one minute or five minutes, and you can configure the action as block, log, or redirect.

The critical difference from AWS WAF is that Azure Front Door rate limiting does not support custom headers or URI path matching in the rate limiting rule itself -- you need to combine it with match conditions. This is less flexible but arguably simpler to reason about. For most API protection scenarios, rate limiting by IP with match conditions for the login or registration paths covers the most important attack vectors.

GCP Cloud Armor Rate Limiting

Cloud Armor rate limiting uses a token bucket algorithm and evaluates traffic per client IP by default. You can configure the rate limit threshold, the conform action (allow), the exceed action (deny with 429, or redirect), and the enforce-on-key (IP, cookie, header, or XFF IP). The token bucket configuration lets you set both the rate and the burst capacity, giving you more control than the fixed-window approaches in AWS and Azure WAF.

WAF rate limiting versus API gateway rate limiting

WAF rate rules are coarse-grained protections designed to catch bad actors. API gateway rate limiting is fine-grained control for managing legitimate consumer traffic. Do not try to use WAF rate rules for API consumer management (the granularity is wrong) and do not rely solely on API gateway throttling for DDoS protection (it operates too late in the request path). Use both layers.

Client-Side Retry Strategies That Actually Work

Rate limiting is only half the equation. The other half is what clients do when they get throttled. Bad retry behavior can turn rate limiting from a protective mechanism into a cascading failure amplifier. I have seen a single client with an aggressive retry policy generate 10x its normal traffic volume after getting throttled, which throttled other clients, which triggered their retries, and so on until the entire API was effectively down.

Exponential Backoff with Jitter

The gold standard for retry behavior is exponential backoff with full jitter. The formula is: sleep = random(0, min(cap, base * 2^attempt)). The exponential backoff ensures that retries spread out over time. The jitter ensures that multiple clients that got throttled at the same moment do not all retry at the same time, which would create a "thundering herd" that overwhelms the rate limiter again.

AWS SDKs implement this by default with a base of 25ms and a cap of 20 seconds. Azure SDKs use a similar strategy. GCP client libraries implement exponential backoff with configurable parameters. If you are building a custom client, use the same pattern. The most common mistake is implementing exponential backoff without jitter, which creates synchronized retry waves.

Respecting Retry-After Headers

When an API returns a 429 response with a Retry-After header, that header tells the client exactly how long to wait before retrying. Always respect this header -- it means the server is giving you specific guidance based on its current load and your rate limit state. Ignoring it and using your own backoff schedule means you might retry too soon (wasting a request and extending your throttle) or too late (unnecessarily delaying your work).

Build your client retry logic to check for Retry-After first. If present, use it. If absent, fall back to exponential backoff with jitter. This two-tier approach handles both well-behaved APIs that provide retry guidance and APIs that just return a bare 429.

Circuit Breaker Integration

For critical paths where retries are not acceptable -- real-time user interactions, payment processing, time-sensitive operations -- integrate rate limiting responses with a circuit breaker pattern. After a configurable number of 429 responses in a short period, the circuit breaker opens and the client stops making requests entirely for a cooldown period, returning a cached response or a graceful degradation to the user.

This prevents the pathological case where a rate-limited client keeps retrying, consuming its rate limit budget on retries instead of new requests. In a microservices architecture, a circuit breaker at the API gateway client layer prevents one overloaded downstream service from consuming the rate limit budget that other services need.

Advanced Patterns for Production

Adaptive Rate Limiting

Static rate limits are a starting point, but production traffic is not static. Adaptive rate limiting adjusts thresholds based on current system health. When the backend is healthy, limits are relaxed to maximize throughput. When latency increases or error rates rise, limits tighten to shed load before the system degrades.

AWS Application Load Balancer does not support adaptive rate limiting natively, but you can approximate it by combining CloudWatch alarms with API Gateway stage-level throttle settings updated via the API. Azure APIM supports this through policies that reference external cache values -- a background process monitors health metrics and updates a rate limit value in Redis that the policy reads on each request. GCP Cloud Armor supports adaptive protection that automatically detects and mitigates L7 DDoS attacks.

Distributed Rate Limiting

In a multi-region or multi-instance deployment, rate limiting counters need to be shared across instances to be effective. Without a shared counter, a client can exceed their global limit by distributing requests across regions or instances.

The standard solution is a centralized or replicated counter store. Redis with its atomic increment operations and TTL-based key expiry is the most common choice. DynamoDB with conditional writes works for AWS-native stacks. For edge rate limiting, Cloudflare Workers KV and AWS CloudFront Functions with CloudFront KeyValueStore both support globally distributed counters with eventual consistency, which is acceptable for rate limiting where exact precision is less important than order-of-magnitude correctness.

Cost-Aware Rate Limiting

Not all API requests cost the same to serve. A request that triggers a complex database query is more expensive than one that hits a cache. Cost-aware rate limiting assigns different token costs to different operations. A list endpoint might cost 1 token per request, while a full-text search costs 10 tokens, and a report generation endpoint costs 100 tokens.

GitHub's GraphQL API uses this approach -- each query has a calculated cost based on the number of nodes requested, and your rate limit budget is measured in "points" rather than raw request count. If you are building a GraphQL API or any API where operation costs vary significantly, this pattern prevents a small number of expensive requests from degrading service for everyone while still allowing high throughput for cheap operations.

Rate limit observability

Whatever rate limiting approach you use, instrument it thoroughly. Track the number of requests allowed versus rejected per client, the distribution of retry attempts, the 99th percentile latency of rate limit checks, and the frequency of rate limit threshold changes. Without this data, you are tuning rate limits blind. Every cloud API gateway provides these metrics -- CloudWatch for API Gateway, Azure Monitor for APIM, Cloud Monitoring for GCP. Set up dashboards and alerts for rate limit rejection spikes, which often indicate either a legitimate traffic increase (time to raise limits) or an attack (time to investigate).

Implementation Checklist

Whether you are implementing rate limiting for the first time or auditing an existing configuration, work through this checklist:

  1. Measure before you limit. Collect at least two weeks of traffic data per endpoint before setting thresholds. Base limits on the 95th percentile of observed traffic with a 2x to 3x safety margin, not on guesses.
  2. Layer your defenses. WAF rate rules at the edge for DDoS and abuse. API gateway throttling for per-consumer fairness. Application-level rate limiting for expensive operations. Each layer serves a different purpose.
  3. Communicate limits to clients. Return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers on every response, not just on 429s. Clients that can see their remaining budget can self-throttle before hitting the limit.
  4. Test failure modes. What happens when your rate limit store (Redis, DynamoDB) is unavailable? Fail open (allow all requests) or fail closed (reject all requests)? The right answer depends on the endpoint. Public APIs should generally fail open to avoid outages. Payment endpoints should fail closed to avoid unbounded processing.
  5. Document everything. Rate limits should be documented in your API specification, developer portal, and error responses. Undocumented rate limits create support tickets and frustrated developers.
  6. Plan for growth. Review and adjust rate limits quarterly. Traffic patterns change. A limit that was generous six months ago may be causing throttling for your fastest-growing customer today.

Related Tools

Written by CloudToolStack Team

Cloud architects with 15+ years of production experience across AWS, Azure, GCP, and OCI. We build free tools and write practical guides to help engineers navigate multi-cloud infrastructure.

Disclaimer: This article is for informational purposes. Cloud services and pricing change frequently; always verify with official provider documentation. AWS, Azure, GCP, and OCI are trademarks of their respective owners.