Disaster Recovery Strategies
Implement disaster recovery on AWS with backup and restore, pilot light, warm standby, and multi-site active/active strategies with RPO/RTO analysis.
Prerequisites
- Understanding of AWS core services (EC2, RDS, S3, VPC)
- Familiarity with multi-region architectures
- Understanding of the AWS Well-Architected Framework
- Experience with infrastructure as code
Understanding Disaster Recovery
Disaster recovery (DR) is the process of preparing for and recovering from events that render your primary infrastructure unavailable. These events range from hardware failures and software bugs to natural disasters and cyberattacks. On AWS, disaster recovery typically means maintaining the ability to recover your applications in a different Availability Zone or Region when the primary location is compromised.
DR is not the same as high availability (HA). High availability focuses on preventing downtime through redundancy within a region (multi-AZ deployments, load balancers, auto scaling). Disaster recovery focuses on recovering from events that affect an entire region or that HA measures cannot handle, such as a region-wide outage, data corruption, or ransomware that encrypts your production data. A well-architected system implements both HA (to handle routine failures) and DR (to handle catastrophic failures).
The right DR strategy is a business decision, not a technical one. It depends on how much downtime you can tolerate (Recovery Time Objective, RTO), how much data loss you can accept (Recovery Point Objective, RPO), and how much you are willing to spend on standby infrastructure. AWS offers four DR strategies along a cost-complexity spectrum, from simple backups ($) to active-active multi-region ($$$$).
Disasters Are Not Just Natural Events
When people think of disaster recovery, they often picture earthquakes or floods. But the most common "disasters" in cloud environments are: accidental data deletion by an engineer, ransomware or security breaches, software bugs that corrupt data, configuration errors that take down services, and (rarely) actual AWS region outages. Your DR strategy should address all of these scenarios, not just infrastructure failures.
RPO & RTO Fundamentals
Every DR strategy is defined by two key metrics: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). These are business requirements that drive your technical architecture and spending decisions.
Recovery Point Objective (RPO) is the maximum acceptable amount of data loss, measured in time. An RPO of 1 hour means you can lose up to 1 hour of data; you need backups or replication that is at most 1 hour behind production. An RPO of zero means no data loss is acceptable, which requires synchronous replication.
Recovery Time Objective (RTO) is the maximum acceptable downtime, how long your application can be unavailable before the business impact becomes unacceptable. An RTO of 4 hours means you need to recover full functionality within 4 hours of a disaster. An RTO of minutes requires hot standby infrastructure ready to take over immediately.
RPO/RTO by DR Strategy
| Strategy | RPO | RTO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours (last backup) | Hours (restore + rebuild) | $ (lowest) | Low |
| Pilot Light | Minutes (continuous replication) | 30-60 minutes (scale up) | $$ | Medium |
| Warm Standby | Minutes (continuous replication) | Minutes (scale up) | $$$ | Medium-High |
| Multi-Site Active/Active | Near zero (synchronous) | Near zero (automatic failover) | $$$$ (highest) | High |
Define RPO/RTO Before Choosing a Strategy
A common mistake is choosing a DR strategy based on what feels right technically rather than what the business actually needs. An e-commerce site processing millions in daily revenue needs a different DR strategy than an internal HR tool used during business hours. Work with business stakeholders to define RPO and RTO for each application, then choose the strategy that meets those requirements at the lowest cost. Over-engineering DR wastes money on standby infrastructure that may never be used.
Backup & Restore Strategy
Backup and restore is the simplest and cheapest DR strategy. You take regular backups of your data and infrastructure configurations, store them in a separate region, and rebuild everything from scratch if a disaster occurs. There is no standby infrastructure running in the DR region; you only pay for backup storage until you need to recover.
This strategy has the highest RPO and RTO because recovery requires restoring data from backups and re-provisioning all infrastructure. RPO is limited by backup frequency (hourly backups mean up to 1 hour of data loss), and RTO can be hours depending on the amount of data to restore and infrastructure to provision.
Key AWS Services for Backup
| Service | Backup Mechanism | Cross-Region Support |
|---|---|---|
| RDS | Automated snapshots, manual snapshots | Cross-region snapshot copy |
| DynamoDB | On-demand backup, PITR (Point-in-Time Recovery) | Cross-region backup via AWS Backup |
| EBS | EBS snapshots (stored in S3) | Cross-region snapshot copy |
| S3 | Cross-Region Replication (CRR) | Native cross-region replication |
| EFS | AWS Backup integration | Cross-region replication |
| Aurora | Continuous backup, manual snapshots | Aurora Global Database, cross-region snapshot |
| Redshift | Automated snapshots | Cross-region snapshot copy |
# CloudFormation - AWS Backup plan with cross-region copy
Resources:
BackupVault:
Type: AWS::Backup::BackupVault
Properties:
BackupVaultName: primary-region-vault
EncryptionKeyArn: !GetAtt BackupKMSKey.Arn
DRBackupVault:
Type: AWS::Backup::BackupVault
Properties:
BackupVaultName: dr-region-vault
# This vault must be created in the DR region
# Use a stack set or separate stack in the DR region
BackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
BackupPlan:
BackupPlanName: production-backup-plan
BackupPlanRule:
# Hourly backups with 24-hour retention
- RuleName: HourlyBackups
TargetBackupVault: !Ref BackupVault
ScheduleExpression: "cron(0 * * * ? *)"
StartWindowMinutes: 60
CompletionWindowMinutes: 120
Lifecycle:
DeleteAfterDays: 1
# Daily backups with 30-day retention + cross-region copy
- RuleName: DailyBackupsWithCrossRegion
TargetBackupVault: !Ref BackupVault
ScheduleExpression: "cron(0 3 * * ? *)"
StartWindowMinutes: 60
CompletionWindowMinutes: 180
Lifecycle:
DeleteAfterDays: 30
CopyActions:
- DestinationBackupVaultArn: !Sub "arn:aws:backup:us-west-2:${AWS::AccountId}:backup-vault:dr-region-vault"
Lifecycle:
DeleteAfterDays: 30
# Weekly backups with 1-year retention
- RuleName: WeeklyBackups
TargetBackupVault: !Ref BackupVault
ScheduleExpression: "cron(0 3 ? * SUN *)"
StartWindowMinutes: 60
Lifecycle:
DeleteAfterDays: 365
BackupSelection:
Type: AWS::Backup::BackupSelection
Properties:
BackupPlanId: !Ref BackupPlan
BackupSelection:
SelectionName: production-resources
IamRoleArn: !GetAtt BackupRole.Arn
# Back up all resources tagged for DR
ListOfTags:
- ConditionType: STRINGEQUALS
ConditionKey: backup
ConditionValue: "true"
Resources:
- "arn:aws:rds:*:*:db:*"
- "arn:aws:dynamodb:*:*:table/*"
- "arn:aws:ec2:*:*:volume/*"Use AWS Backup for Centralized Management
AWS Backup provides a single console and API to manage backups across RDS, DynamoDB, EBS, EFS, S3, Aurora, and more. Instead of configuring backups individually for each service, define a backup plan with rules and tag your resources for inclusion. AWS Backup handles cross-region copies, retention lifecycle, and compliance reporting. It also supports backup policies through AWS Organizations, ensuring consistent backup practices across all accounts.
Pilot Light Strategy
The pilot light strategy keeps a minimal version of your core infrastructure running in the DR region at all times. Like a furnace pilot light that can ignite the full burner quickly, this minimal footprint can be scaled up to full production capacity when needed. The "core" typically includes database replicas and essential data stores, the hardest and slowest components to rebuild.
In the DR region, you maintain: cross-region database replicas (RDS read replica, Aurora Global Database, DynamoDB Global Tables), core networking (VPC, subnets, security groups), and infrastructure-as-code templates ready to provision compute resources. Compute resources (EC2, ECS, Lambda) are NOT running; they are provisioned only when failover is triggered.
# Pilot light infrastructure in DR region
Resources:
# Core data layer - always running
DRDatabase:
Type: AWS::RDS::DBInstance
Properties:
SourceDBInstanceIdentifier: !Sub "arn:aws:rds:us-east-1:${AWS::AccountId}:db:production-db"
DBInstanceClass: db.r6g.large # Can be smaller than production
MultiAZ: false # Single AZ to save cost in standby
StorageEncrypted: true
# DynamoDB Global Table (automatically replicated)
# Note: DynamoDB Global Tables are configured in the primary region
# and automatically maintain replicas in specified regions
# VPC and networking - always provisioned
DRVPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.1.0.0/16
EnableDnsHostnames: true
Tags:
- Key: Name
Value: dr-vpc
# AMIs and launch templates - always ready
DRLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: dr-web-server
LaunchTemplateData:
ImageId: !Ref LatestAMI
InstanceType: t3.large
SecurityGroupIds:
- !Ref DRWebSecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash
# Application bootstrap script
aws s3 cp s3://deployment-artifacts/latest/app.tar.gz /opt/app/
tar xzf /opt/app/app.tar.gz -C /opt/app/
systemctl start myapp
# Auto Scaling Group - configured but with 0 instances
DRAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: dr-web-servers
LaunchTemplate:
LaunchTemplateId: !Ref DRLaunchTemplate
Version: !GetAtt DRLaunchTemplate.LatestVersionNumber
MinSize: 0 # No instances running in standby
MaxSize: 20
DesiredCapacity: 0 # Scale up during failover
VPCZoneIdentifier:
- !Ref DRPrivateSubnet1
- !Ref DRPrivateSubnet2
TargetGroupARNs:
- !Ref DRTargetGroupFailover Process
When a disaster is declared, the pilot light failover process is: (1) promote the RDS read replica to a standalone database, (2) scale up the Auto Scaling Group from 0 to the desired capacity, (3) update DNS (Route 53) to point to the DR region's load balancer, and (4) verify the application is healthy. This process can be automated with AWS Lambda and Step Functions, triggered by Route 53 health check failures or manual invocation.
Warm Standby Strategy
Warm standby extends the pilot light approach by running a scaled-down but fully functional copy of your production environment in the DR region. Unlike pilot light (where compute is at zero), warm standby runs a minimum number of instances that can handle a fraction of production traffic. This reduces RTO because the application is already running; failover only requires scaling up and switching DNS.
In a warm standby, you maintain: full database replicas, a reduced-scale compute fleet (e.g., 2 instances instead of 20), load balancers, and all supporting infrastructure. The warm standby can also serve read-only traffic or non-critical workloads during normal operations, partially offsetting its cost.
#!/bin/bash
# Warm standby failover script
# Triggered manually or by automated health check failure
set -e
DR_REGION="us-west-2"
PRIMARY_REGION="us-east-1"
HOSTED_ZONE_ID="Z1234567890"
DR_ALB_DNS="dr-alb-123456.us-west-2.elb.amazonaws.com"
DR_ALB_HOSTED_ZONE="Z35SXDOTRQ7X7K"
echo "=== INITIATING DISASTER RECOVERY FAILOVER ==="
echo "Timestamp: $(date -u '+%Y-%m-%dT%H:%M:%SZ')"
# Step 1: Promote RDS read replica to standalone primary
echo "Step 1: Promoting RDS read replica..."
aws rds promote-read-replica \
--db-instance-identifier dr-production-db \
--region $DR_REGION
aws rds wait db-instance-available \
--db-instance-identifier dr-production-db \
--region $DR_REGION
echo "RDS replica promoted successfully."
# Step 2: Scale up compute
echo "Step 2: Scaling up compute resources..."
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name dr-web-servers \
--min-size 4 \
--max-size 20 \
--desired-capacity 8 \
--region $DR_REGION
# Wait for instances to be healthy
echo "Waiting for instances to register with target group..."
aws elbv2 wait target-in-service \
--target-group-arn "arn:aws:elasticloadbalancing:$DR_REGION:123456789012:targetgroup/dr-tg/1234567890" \
--region $DR_REGION
echo "Compute scaled and healthy."
# Step 3: Update DNS to point to DR region
echo "Step 3: Updating DNS..."
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "'$DR_ALB_HOSTED_ZONE'",
"DNSName": "'$DR_ALB_DNS'",
"EvaluateTargetHealth": true
}
}
}]
}'
echo "DNS updated to DR region."
# Step 4: Validate
echo "Step 4: Running health checks..."
for i in {1..5}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://app.example.com/health)
echo "Health check $i: HTTP $STATUS"
sleep 10
done
echo "=== FAILOVER COMPLETE ==="
echo "DR region $DR_REGION is now serving production traffic."Test Failover Regularly
An untested disaster recovery plan is not a plan; it is a hope. Schedule DR failover tests at least quarterly. A full test should include: promoting the database replica, scaling up compute, switching DNS, running smoke tests against the DR environment, and then failing back to the primary region. Document any issues found during testing and fix them before the next test. Many organizations discover during their first DR test that their runbook is outdated, IAM permissions are missing, or the DR environment has drifted from production.
Multi-Site Active/Active Strategy
Multi-site active/active is the most resilient (and most expensive) DR strategy. Your application runs at full capacity in two or more regions simultaneously, and traffic is distributed across all regions using Route 53 latency-based or weighted routing. If one region fails, traffic is automatically routed to the remaining healthy regions. There is no failover process; each region is independently capable of handling full production traffic.
Active/active requires your application and data layer to handle multi-region writes. This means using globally replicated databases (DynamoDB Global Tables, Aurora Global Database with write forwarding) and designing your application for eventual consistency. Conflict resolution becomes a design concern: what happens when two users update the same record in different regions simultaneously?
Active/Active Architecture Components
| Component | Multi-Region Approach | Consistency Model |
|---|---|---|
| DNS routing | Route 53 latency-based routing with health checks | N/A |
| CDN | CloudFront with origin groups per region | Cache TTL-based |
| Compute | Independent ECS/EKS/Lambda per region | Stateless |
| Relational DB | Aurora Global Database (write forwarding) | Async replication (typically < 1s lag) |
| NoSQL DB | DynamoDB Global Tables | Eventually consistent (typically < 1s) |
| Cache | Independent ElastiCache per region | Independent (no cross-region sync) |
| Object storage | S3 Cross-Region Replication | Eventually consistent (minutes) |
| Messaging | SQS/SNS per region, EventBridge cross-region | Independent |
# Route 53 latency-based routing for active/active
Resources:
# Health checks for each region
USEast1HealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Type: HTTPS
FullyQualifiedDomainName: us-east-1.internal.example.com
Port: 443
ResourcePath: /health
RequestInterval: 10
FailureThreshold: 3
EnableSNI: true
USWest2HealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Type: HTTPS
FullyQualifiedDomainName: us-west-2.internal.example.com
Port: 443
ResourcePath: /health
RequestInterval: 10
FailureThreshold: 3
EnableSNI: true
# Latency-based routing records
USEast1Record:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: app.example.com
Type: A
SetIdentifier: us-east-1
Region: us-east-1
HealthCheckId: !Ref USEast1HealthCheck
AliasTarget:
HostedZoneId: Z35SXDOTRQ7X7K
DNSName: us-east-1-alb.example.com
EvaluateTargetHealth: true
USWest2Record:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: app.example.com
Type: A
SetIdentifier: us-west-2
Region: us-west-2
HealthCheckId: !Ref USWest2HealthCheck
AliasTarget:
HostedZoneId: Z1H1FL5HABSF5
DNSName: us-west-2-alb.example.com
EvaluateTargetHealth: trueDynamoDB Global Tables for Active/Active
DynamoDB Global Tables automatically replicate data across regions with sub-second latency. Writes to any region are propagated to all other regions. Conflict resolution uses "last writer wins" based on timestamps. For most use cases, this is sufficient. For applications requiring custom conflict resolution (like collaborative document editing), you may need application-level conflict detection using version vectors or CRDTs (Conflict-free Replicated Data Types).
DR Strategy Comparison
Choosing the right DR strategy requires balancing cost, complexity, RPO, and RTO. The following comparison helps visualize the trade-offs:
| Factor | Backup & Restore | Pilot Light | Warm Standby | Active/Active |
|---|---|---|---|---|
| RPO | Hours | Minutes | Minutes | Seconds |
| RTO | Hours | 30-60 min | 5-15 min | Near zero |
| Standby cost | ~5% of production | ~15% of production | ~30-50% of production | ~100% of production |
| Failover automation | Mostly manual | Semi-automated | Automated | Automatic |
| Data layer in DR | Backups only | Live replicas | Live replicas | Active writes |
| Compute in DR | Nothing running | Nothing running | Reduced capacity | Full capacity |
| Best for | Non-critical apps, dev/test | Business apps, moderate SLAs | Important apps, strong SLAs | Mission-critical, zero downtime |
Cross-Region Replication Patterns
Regardless of which DR strategy you choose, cross-region data replication is the foundation. Each AWS data service has its own replication mechanism with different consistency guarantees, latency characteristics, and cost models.
Database Replication
# Create an RDS cross-region read replica
aws rds create-db-instance-read-replica \
--db-instance-identifier dr-production-replica \
--source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:production-db \
--db-instance-class db.r6g.large \
--region us-west-2 \
--storage-encrypted \
--kms-key-id "arn:aws:kms:us-west-2:123456789012:key/dr-key-id"
# Create an Aurora Global Database
aws rds create-global-cluster \
--global-cluster-identifier my-global-cluster \
--source-db-cluster-identifier arn:aws:rds:us-east-1:123456789012:cluster:production-aurora \
--region us-east-1
# Add a secondary region to Aurora Global Database
aws rds create-db-cluster \
--db-cluster-identifier aurora-dr-cluster \
--global-cluster-identifier my-global-cluster \
--engine aurora-postgresql \
--engine-version 15.4 \
--region us-west-2
aws rds create-db-instance \
--db-instance-identifier aurora-dr-instance-1 \
--db-cluster-identifier aurora-dr-cluster \
--db-instance-class db.r6g.large \
--engine aurora-postgresql \
--region us-west-2
# Enable DynamoDB Global Tables (add region)
aws dynamodb update-table \
--table-name Orders \
--replica-updates '[{"Create": {"RegionName": "us-west-2"}}]' \
--region us-east-1S3 Cross-Region Replication
Resources:
SourceBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: my-app-data-us-east-1
VersioningConfiguration:
Status: Enabled # Required for CRR
ReplicationConfiguration:
Role: !GetAtt ReplicationRole.Arn
Rules:
- Id: ReplicateAll
Status: Enabled
Destination:
Bucket: !Sub "arn:aws:s3:::my-app-data-us-west-2"
StorageClass: STANDARD_IA # Use cheaper storage class in DR
ReplicationTime:
Status: Enabled
Time:
Minutes: 15 # S3 Replication Time Control (RTC)
Metrics:
Status: Enabled
EventThreshold:
Minutes: 15
Filter:
Prefix: "" # Replicate all objects
DeleteMarkerReplication:
Status: EnabledReplication Is Not Backup
Cross-region replication protects against regional failures but does NOT protect against data corruption or accidental deletion. If an engineer accidentally deletes a DynamoDB table, the deletion is replicated to all regions within seconds. You still need point-in-time backups in addition to replication. Enable DynamoDB PITR, RDS automated snapshots, and S3 versioning to protect against data-level disasters.
AWS Elastic Disaster Recovery
AWS Elastic Disaster Recovery (AWS DRS, formerly CloudEndure Disaster Recovery) is a managed service that provides continuous block-level replication of your source servers to AWS. It supports any server that can run the DRS agent, including on-premises physical servers, VMware VMs, and EC2 instances in another region.
DRS continuously replicates your server disks to a staging area in the target region. When you initiate a recovery, DRS launches EC2 instances from the replicated data, converting the source server configuration (network, instance type, security groups) to the target region. The RPO is typically measured in seconds (the replication lag), and the RTO is minutes (the time to launch EC2 instances from replicated snapshots).
DRS vs Traditional DR Approaches
| Aspect | AWS DRS | AMI-Based DR | Infrastructure-as-Code DR |
|---|---|---|---|
| RPO | Seconds (continuous replication) | Hours (snapshot frequency) | Depends on data replication |
| RTO | Minutes | 30-60 minutes | 30-60 minutes |
| What is replicated | Full server state (OS + data + config) | AMI snapshot (point-in-time) | Infrastructure definition only |
| Ongoing cost | $0.028/server/hour + staging storage | Snapshot storage only | Minimal (templates in S3) |
| Ideal for | Lift-and-shift, legacy apps | Stateless applications | Cloud-native applications |
# Install the DRS agent on a source server
wget -O aws-replication-installer-init https://aws-elastic-disaster-recovery-us-west-2.s3.us-west-2.amazonaws.com/latest/linux/aws-replication-installer-init
chmod +x aws-replication-installer-init
sudo ./aws-replication-installer-init \
--region us-west-2 \
--aws-access-key-id AKIA... \
--aws-secret-access-key ... \
--no-prompt
# Initiate a recovery drill (non-destructive test)
aws drs start-recovery \
--source-servers '[{"sourceServerID": "s-1234567890abcdef0"}]' \
--is-drill true \
--region us-west-2
# Monitor recovery job
aws drs describe-jobs \
--filters '{"jobIDs": ["job-1234567890abcdef0"]}' \
--region us-west-2DRS for Non-Cloud-Native Workloads
AWS DRS is particularly valuable for legacy applications that cannot be easily re-architected for cloud-native DR. If you have a monolithic application running on EC2 that stores state on local EBS volumes, DRS provides sub-minute RPO without modifying the application. For cloud-native applications (containerized, serverless, using managed databases), infrastructure-as-code with database replication is typically simpler and cheaper than DRS.
Testing & Runbooks
A disaster recovery plan is only as good as its last test. Regular DR testing validates that your failover procedures work, your team knows what to do, and your RTO/RPO objectives can actually be met. Without testing, you are likely to discover problems during a real disaster, exactly when you can least afford them.
DR Testing Types
| Test Type | What It Tests | Impact | Frequency |
|---|---|---|---|
| Tabletop exercise | Process and communication | None (discussion only) | Quarterly |
| Walkthrough drill | Runbook accuracy (execute steps in non-prod) | Low (non-production) | Quarterly |
| Parallel test | DR environment functionality (no traffic switch) | Medium (DR resources used) | Semi-annually |
| Full failover test | Complete failover and failback | High (production traffic moved) | Annually |
| Chaos engineering | Resilience under unexpected conditions | Variable | Ongoing |
DR Runbook Template
A runbook is a step-by-step procedure for executing a failover. Every DR plan should have a documented runbook that is version-controlled, reviewed regularly, and practiced by the on-call team. The runbook should be executable by any engineer on the team, not just the person who wrote it.
# Disaster Recovery Runbook - Production Failover
## Prerequisites
- [ ] Access to AWS Console with DR admin role
- [ ] Access to DNS management (Route 53)
- [ ] Communication channels established (Slack #incident, PagerDuty)
- [ ] DR runbook reviewed within last 90 days
## Decision Criteria
- Primary region health check failing for > 5 minutes
- AWS Health Dashboard showing regional issue
- Data corruption detected in primary region
- Approved by: VP of Engineering or on-call incident commander
## Failover Steps
1. [ ] Declare incident in Slack #incident channel
2. [ ] Verify DR region health: curl https://dr.internal.example.com/health
3. [ ] Promote RDS read replica: aws rds promote-read-replica ...
4. [ ] Wait for DB promotion: aws rds wait db-instance-available ...
5. [ ] Scale up compute: aws autoscaling update-auto-scaling-group ...
6. [ ] Wait for targets healthy: aws elbv2 wait target-in-service ...
7. [ ] Update application config to point to promoted DB
8. [ ] Switch DNS: aws route53 change-resource-record-sets ...
9. [ ] Verify application health: run smoke test suite
10. [ ] Monitor error rates for 15 minutes
11. [ ] Communicate status to stakeholders
## Failback Steps (after primary region recovery)
1. [ ] Verify primary region is healthy
2. [ ] Re-establish replication from DR to primary
3. [ ] Wait for replication to catch up
4. [ ] Switch DNS back to primary region
5. [ ] Scale down DR compute
6. [ ] Re-establish primary-to-DR replication
7. [ ] Conduct post-incident review
## Contacts
- On-call: PagerDuty rotation "Platform Engineering"
- Escalation: VP Engineering (Jane Doe)
- AWS Support: Enterprise support case (Severity 1)Automate Everything You Can
Manual failover steps introduce delay and human error during the most stressful moments. Automate as much of your failover procedure as possible using AWS Step Functions, Lambda, and Systems Manager Automation runbooks. A fully automated failover triggered by Route 53 health check failures can achieve an RTO of minutes, while a manual runbook typically takes 30-60 minutes assuming the on-call engineer is awake and available. Keep manual approval gates only for the initial "declare disaster" decision. Once declared, everything else should be automated.
Post-Incident Review
After every DR event (real or test), conduct a blameless post-incident review. Document what happened, what the timeline was, what worked well, what did not work, and what changes are needed. Update the runbook with lessons learned. Common findings include: runbook steps that are outdated, IAM permissions that are missing, configuration drift between primary and DR environments, and undocumented dependencies that broke during failover.
Track your actual RTO and RPO achieved during tests and compare them to your targets. If your target RTO is 15 minutes but your test took 45 minutes, you have a gap that needs addressing, either through automation, pre-provisioning, or resetting expectations with stakeholders.
Well-Architected Framework: Reliability Pillar for DR PlanningMulti-Cloud Disaster Recovery: Comparing DR Across ProvidersKey Takeaways
- 1RPO (Recovery Point Objective) defines maximum acceptable data loss; RTO (Recovery Time Objective) defines maximum acceptable downtime.
- 2Backup and restore is cheapest but has highest RTO (hours); multi-site active/active is most expensive but has near-zero RTO.
- 3Pilot light keeps core infrastructure running in the DR region at minimal cost.
- 4Warm standby runs a scaled-down version of the production environment for faster recovery.
- 5AWS Elastic Disaster Recovery (DRS) provides continuous replication with sub-second RPO.
- 6Regular DR testing is essential because untested disaster recovery plans will fail when needed.
Frequently Asked Questions
What is the difference between RPO and RTO?
Which DR strategy should I choose?
What is AWS Elastic Disaster Recovery?
How do I test my DR plan?
How much does multi-region DR cost?
Written by CloudToolStack Team
Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.
Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.