Building SRE Incident Response Runbooks for Cloud Infrastructure

Why Most Runbooks Collect Dust

Every SRE team has runbooks. Most of them are useless. They were written after a major incident, stuffed into a Confluence page, and never updated. When the next incident hits, the on-call engineer opens the runbook, discovers it references a monitoring dashboard that no longer exists, describes a service architecture that was refactored six months ago, and includes SSH commands for servers that have been decommissioned. They close the runbook and troubleshoot from scratch.

I have been on-call for production cloud infrastructure for over a decade, and I can tell you that the difference between a 15-minute resolution and a 3-hour outage is almost never technical skill. It is whether the on-call engineer has a runbook that tells them exactly what to check, in what order, and what the expected output should be at each step. Good runbooks are not documentation. They are executable decision trees that turn a stressful, ambiguous incident into a structured, repeatable process.

This guide covers how to build runbooks that actually get used, with specific templates for the five most common cloud infrastructure incidents.

Runbook Structure That Works

After iterating on runbook formats across multiple organizations, I have settled on a structure that balances completeness with usability during the stress of an active incident.

Section 1: Alert Context (30 Seconds to Read)

The first section answers: "What just fired, and how bad is it?" Include the alert name, the metric that triggered it, the threshold, and the severity level. Most importantly, include the customer impact statement -- what users experience when this alert fires. "CPU above 90 percent" does not convey urgency. "API response times exceed 2 seconds, affecting all users of the checkout flow" does.

Section 2: Quick Triage (2 Minutes)

A checklist of 3 to 5 diagnostic steps that determine the root cause category. Each step should be a specific command or dashboard link with the expected healthy output. The engineer runs through these steps in order and branches to the appropriate remediation section based on what they find. Do not explain why each step matters during triage -- that is for the learning section at the end.

Section 3: Remediation Steps (Varies)

Specific, copy-pasteable commands or console steps for each root cause category identified in triage. Include rollback procedures for every remediation action. If a step requires approval or escalation, say so explicitly and include the escalation contact.

Section 4: Verification

How to confirm that the remediation worked. This is the set of metrics, log queries, or health checks that should return to normal after the fix. Include the expected time to recovery -- "CPU should drop below 70 percent within 5 minutes of scaling" or "Error rate should return to baseline within 2 minutes of the deployment rollback."

Section 5: Escalation Matrix

Who to contact if the runbook steps do not resolve the issue, organized by time of day and severity. Include both the communication channel (Slack channel, PagerDuty escalation, phone number) and the expected response time for each level.

The runbook test

A good runbook passes the "new hire test": could an engineer who joined the team last week follow this runbook during an incident at 3 AM and resolve the issue without calling someone else? If the answer is no, the runbook is missing context, commands, or decision criteria. Game-day exercises where a junior engineer follows the runbook while a senior engineer observes are the best way to find gaps.

Runbook 1: High CPU Utilization

Alert Context

Alert: CPU utilization exceeds 85 percent for 5 minutes on [service-name].
Severity: P2 (P1 if checkout or authentication service).
Customer impact: Increased API latency. At sustained 95 percent or above, requests begin timing out and returning 503 errors.

Quick Triage

Check if it is a single instance or cluster-wide. Open the service dashboard and look at CPU across all instances. If one instance is hot and others are normal, the issue is likely a stuck process or uneven load balancing. If all instances are high, it is a traffic spike or a code regression.
Check recent deployments. Was a new version deployed in the last 2 hours? Code regressions (infinite loops, inefficient queries, missing pagination) are the most common cause of sudden CPU increases. Check your deployment tool for recent releases.
Check traffic volume. Compare current request rate to the same time yesterday and last week. A proportional increase in traffic and CPU suggests organic growth or a traffic spike. A disproportionate increase (CPU doubled but traffic is normal) suggests a code or dependency issue.
Check for downstream dependency issues. If a downstream service is slow or failing, your service may be retrying requests, holding connections, or spinning in retry loops. Check the health of databases, caches, and external APIs.

Remediation

If single instance: Restart the instance. On ECS, stop the task (a new one will launch automatically). On Kubernetes, delete the pod. On EC2, check top processes first, then reboot if necessary.

If code regression: Roll back to the previous deployment version. Do not debug in production -- roll back first, investigate later. Verify CPU returns to normal within 5 minutes.

If traffic spike: Scale out. Increase the desired count for ECS services or trigger a manual scale on the HPA. If autoscaling is configured, verify it is responding and check whether it has hit its maximum. Increase the maximum if needed.

If downstream dependency: Enable circuit breakers if available. Add rate limiting on the retry path. Escalate to the dependency team.

Build CloudWatch Logs queries to investigate high CPU causes

Runbook 2: Disk Full

Alert Context

Alert: Disk utilization exceeds 90 percent on [instance/volume].
Severity: P1 (data loss risk if the volume fills completely).
Customer impact: Application may fail to write logs, temp files, or database transactions. Databases will crash or become read-only.

Quick Triage

Identify the full filesystem. SSH into the instance and run df -h. Note which mount point is full: root volume, data volume, or temp directory.
Find large files. Run du -sh /* 2>/dev/null | sort -rh | head -20 on the full filesystem. Common culprits: log files under /var/log, core dumps under /tmp, database WAL files, or unrotated application logs.
Check log rotation. Verify logrotate is configured and running. Check /var/log/syslog or /var/log/messages for logrotate errors. Applications writing to custom log paths often bypass logrotate entirely.

Remediation

Immediate relief: Delete or compress old log files. Remove core dumps from /tmp. Clear package manager caches (apt clean, yum clean all). If the database WAL directory is full, check for replication lag or a hung replication slot.

For EBS volumes on AWS: You can increase the volume size online without downtime. Modify the volume in the EC2 console, then run growpart and resize2fs (ext4) or xfs_growfs (XFS) to extend the filesystem. This takes effect within minutes.

For persistent volumes in Kubernetes: If the StorageClass supports volume expansion (most CSI drivers do), edit the PVC spec to increase the requested storage. The kubelet will resize the volume and filesystem automatically.

Prevention: Set up disk usage alerts at 70 percent (warning) and 85 percent (critical). Configure log rotation for all application logs. Use lifecycle policies on log storage (CloudWatch Logs retention, S3 lifecycle rules). Set up automated volume expansion scripts for production databases.

Runbook 3: Certificate Expiry

Alert Context

Alert: TLS certificate for [domain] expires in [N] days.
Severity: P1 if less than 7 days, P2 if 7 to 30 days.
Customer impact: When the certificate expires, all HTTPS connections fail with certificate errors. Browsers show security warnings. API clients reject the connection. This is a complete outage for the affected domain.

Quick Triage

Verify the expiry date. Run echo | openssl s_client -connect domain.com:443 -servername domain.com 2>/dev/null | openssl x509 -noout -dates. This shows the actual certificate being served, which may differ from what your certificate manager reports.
Check the certificate source. Is this an ACM certificate (auto-renewed by AWS), a Let's Encrypt certificate (auto-renewed by certbot or another ACME client), or a manually purchased certificate? ACM certificates auto-renew 60 days before expiry if DNS or email validation is still valid. Let's Encrypt certificates auto-renew 30 days before expiry if the renewal process is working.
Check renewal logs. For certbot, check /var/log/letsencrypt/letsencrypt.log. For ACM, check the ACM console for renewal status. Common failures: DNS validation record was deleted, HTTP validation path is blocked by WAF rules, or the certbot cron job stopped running.

Remediation

ACM certificate not renewing: Check the validation status in the ACM console. If DNS validation, verify the CNAME record still exists in Route 53 or your DNS provider. If the record was deleted, recreate it from the ACM certificate details. ACM will retry validation and issue the renewed certificate within hours.

Let's Encrypt not renewing: Run certbot renew --dry-run to test the renewal process. Fix any errors, then run certbot renew to force renewal. Restart nginx/Apache to pick up the new certificate.

Emergency: Certificate already expired: If the certificate has already expired and you cannot renew it immediately, the fastest mitigation is to put CloudFront, Azure Front Door, or a Cloudflare proxy in front of the domain. These services can issue new certificates in minutes, restoring HTTPS while you fix the underlying renewal issue.

Build Azure Monitor queries for certificate and service health monitoring

Runbook 4: DNS Resolution Failure

Alert Context

Alert: DNS resolution failing for [domain/service].
Severity: P1 (complete service outage for affected domain).
Customer impact: Users cannot reach the application. API clients fail with DNS resolution errors. All services depending on the affected domain are impacted.

Quick Triage

Verify from multiple locations. Check DNS resolution from your workstation, a server in the affected region, and an external DNS checker (dig from Google DNS: dig @8.8.8.8 domain.com). If it fails everywhere, the authoritative DNS is broken. If it fails only from specific locations, it is a propagation or routing issue.
Check the authoritative nameservers. Run dig NS domain.com to see the nameservers, then query them directly: dig @ns-xxx.awsdns-xx.com domain.com. If the authoritative servers respond correctly, the issue is with resolvers or caching. If they fail, the hosted zone configuration is broken.
Check for recent DNS changes. Look at the Route 53, Azure DNS, or Cloud DNS change history. A deleted record, modified TTL, or changed nameserver delegation can cause immediate resolution failures.
Check the domain registrar. Verify the domain has not expired and the nameserver delegation matches your DNS provider. Domain expiry is a surprisingly common cause of DNS outages for secondary domains and newly acquired domains.

Remediation

Deleted DNS record: Recreate the record in your DNS provider. If you do not know the correct value, check your IaC state file (Terraform state, CloudFormation stack) or recent DNS audit logs. Route 53 records propagate in 60 seconds for records with a 60-second TTL.

Nameserver mismatch: If the registrar's nameservers do not match the DNS provider's nameservers, update the registrar. This happens when someone recreates a hosted zone (which assigns new nameservers) without updating the registrar. Propagation of NS record changes takes 24 to 48 hours, so this is not a quick fix.

Domain expired: Renew the domain immediately at the registrar. Most registrars have a grace period after expiry. If the domain entered redemption, contact the registrar's support for expedited recovery.

DNS TTL during incidents

If you need to make a DNS change during an incident, remember that the old record may be cached by resolvers for the duration of its TTL. A record with a 3600-second (1-hour) TTL means some users will see the old value for up to an hour after you make the change. For critical records, proactively lower the TTL to 60 or 300 seconds before planned changes. During an unplanned incident, you cannot speed up cache expiry -- you just have to wait.

Build GCP Cloud Logging queries for DNS and network troubleshooting

Runbook 5: Database Connection Exhaustion

Alert Context

Alert: Database connections exceed 80 percent of max_connections on [database instance].
Severity: P1 (new connections will fail when limit is reached, causing application errors).
Customer impact: Application returns 500 errors for all requests that require database access. Partial or complete service outage.

Quick Triage

Check current connection count and sources. For PostgreSQL: SELECT client_addr, usename, datname, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY 4 DESC. For MySQL: SELECT user, host, db, COUNT(*) as connections FROM information_schema.processlist GROUP BY 1,2,3 ORDER BY 4 DESC. This tells you which application or service is consuming the most connections.
Check for idle connections. For PostgreSQL: SELECT count(*) FROM pg_stat_activity WHERE state = 'idle'. A high number of idle connections suggests a connection pooling misconfiguration or a connection leak in the application.
Check for long-running queries. For PostgreSQL: SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes' ORDER BY duration DESC. Long-running queries hold connections open and can cause connection pile-ups.
Check recent deployments or scaling events. A deployment that doubled the number of application instances also doubles the number of database connections if each instance maintains its own connection pool. An autoscaling event can have the same effect.

Remediation

Immediate relief -- kill idle connections. For PostgreSQL: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes'. This frees connections that are idle and likely leaked. The application will reconnect on the next request.

Kill long-running queries. If a specific query is holding many connections: SELECT pg_cancel_backend(pid) (graceful) or SELECT pg_terminate_backend(pid) (forceful) for the offending processes.

Reduce application pool size. If the connection surge is from a scaling event, reduce the per-instance pool size so that the total connections (instances multiplied by pool_size) stays below 80 percent of max_connections. This is an application configuration change that typically requires a restart.

Increase max_connections. As a temporary measure, increase max_connections on the database. For RDS, modify the parameter group (requires a reboot for static parameters, but max_connections on RDS is dynamic for PostgreSQL). Be aware that increasing max_connections increases memory usage -- each connection consumes 5 to 10 MB.

Add connection pooling. If you do not have a connection pooler (RDS Proxy, PgBouncer, or the Azure built-in pooler), deploy one. This is the permanent fix for connection exhaustion in applications with many instances or serverless functions.

Alert Correlation: Connecting the Dots

Individual alerts tell you something is wrong. Correlated alerts tell you why. Build your alerting system to group related alerts and surface the probable root cause.

Example correlation: High CPU + increased request rate + normal error rate = organic traffic spike. High CPU + normal request rate + increased error rate = code regression or dependency failure. High CPU + decreased request rate + increased error rate = downstream dependency timeout causing retry storms.

Example correlation: Database connection exhaustion + application 500 errors + recent deployment = new deployment increased connection usage. Database connection exhaustion + application 500 errors + autoscaling event = pool size not adjusted for horizontal scaling.

Most monitoring platforms (Datadog, PagerDuty, Grafana OnCall) support alert grouping and correlation rules. Configure them to group alerts from the same service within a 5-minute window, and add correlation rules for known patterns. This reduces alert fatigue and helps the on-call engineer focus on the root cause rather than chasing symptoms.

Post-Incident Review Template

Every P1 and P2 incident should result in a post-incident review (often called a postmortem, though many teams prefer the less morbid term). The review serves two purposes: it identifies systemic improvements to prevent recurrence, and it updates the runbook with what the team learned during the incident.

Post-Incident Review Structure

Summary: One paragraph describing what happened, the duration, and the customer impact. Include the severity level and any SLA implications.
Timeline: A chronological list of events from the first alert to full resolution. Include timestamps, actions taken, and who took them. Be specific -- "10:23 UTC: On-call engineer received PagerDuty alert for high CPU on checkout-service" is better than "Alert fired."
Root cause: What specifically caused the incident. Go deep -- "a code change caused a memory leak" is not enough. "PR #4523 introduced a goroutine that allocates a 10 MB buffer per request without releasing it, causing memory to grow by 10 MB per second under load" is a root cause.
What went well: What worked during the response. Did the runbook help? Did alerting fire promptly? Did the team coordinate effectively?
What went poorly: What did not work. Was the runbook outdated? Did alerting fire too late? Was the escalation path unclear? Be honest -- this section drives improvement.
Action items: Specific, assigned, time-bound tasks to prevent recurrence. Each action item should have an owner and a due date. "Improve monitoring" is not an action item. "Add a memory usage alert at 80 percent threshold to the checkout-service dashboard by March 15 (owner: Jane)" is.

Blameless does not mean actionless

Blameless postmortems are about not punishing individuals for honest mistakes. They are not about avoiding accountability for systemic improvements. If the incident happened because a deployment bypassed the staging environment, the action item is to enforce the deployment pipeline, not to blame the engineer who deployed. But there must be an action item. A postmortem without action items is a waste of everyone's time.

Keeping Runbooks Alive

The hardest part of runbooks is maintenance. Here are practices that keep them current.

Update after every incident. After each post-incident review, update the relevant runbook with any new diagnostic steps, corrected commands, or changed thresholds. Make this an explicit action item in every review.
Monthly runbook review. Assign a rotating team member to review 2 to 3 runbooks each month. They should verify that dashboard links work, commands execute correctly, and the described architecture matches reality.
Game days. Quarterly, simulate an incident and have the on-call engineer follow the runbook step by step. Time the resolution. Note where the runbook is unclear or outdated. Game days are the single most effective practice for maintaining runbook quality.
Runbook as code. Store runbooks alongside your infrastructure code in version control. This makes them searchable, versionable, and reviewable. Some teams use Jupyter notebooks or Backstage TechDocs for runbooks, which allows embedding live queries and automated checks.

Runbooks are a form of institutional knowledge. When your most experienced engineer leaves, the knowledge in their head leaves too. Runbooks are how you capture that knowledge in a form that any team member can execute under pressure. The investment in creating and maintaining them pays back every time an incident happens at 2 AM and the person on call is not your most senior engineer.

Why Most Runbooks Collect Dust

Runbook Structure That Works

Section 1: Alert Context (30 Seconds to Read)

Section 2: Quick Triage (2 Minutes)

Section 3: Remediation Steps (Varies)

Section 4: Verification

Section 5: Escalation Matrix

Runbook 1: High CPU Utilization

Alert Context

Quick Triage

Remediation

Runbook 2: Disk Full

Alert Context

Quick Triage

Remediation

Runbook 3: Certificate Expiry

Alert Context

Quick Triage

Remediation

Runbook 4: DNS Resolution Failure

Alert Context

Quick Triage

Remediation

Runbook 5: Database Connection Exhaustion

Alert Context

Quick Triage

Remediation

Alert Correlation: Connecting the Dots

Post-Incident Review Template

Post-Incident Review Structure

Keeping Runbooks Alive

Try These Tools