AWSArchitectureadvanced

Disaster Recovery Strategies

Q: What is the difference between RPO and RTO?

RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time, or how far back you can afford to lose data. RTO (Recovery Time Objective) is the maximum acceptable downtime, or how quickly you need to recover. Both drive your DR strategy choice and cost.

Q: Which DR strategy should I choose?

It depends on your RPO/RTO requirements and budget. Backup and restore: RPO hours, RTO hours, lowest cost. Pilot light: RPO minutes, RTO 10s of minutes, low cost. Warm standby: RPO seconds, RTO minutes, moderate cost. Multi-site active/active: RPO near-zero, RTO near-zero, highest cost.

Q: What is AWS Elastic Disaster Recovery?

AWS Elastic Disaster Recovery (AWS DRS, formerly CloudEndure) provides continuous block-level replication of servers to AWS. It maintains staging resources at minimal cost and can launch recovery instances within minutes. It supports both cloud-to-cloud and on-premises-to-cloud DR.

Q: How do I test my DR plan?

Use non-disruptive DR drills: launch recovery instances in the DR region without affecting production, validate applications work correctly, then terminate test resources. AWS DRS supports drill mode natively. Schedule quarterly DR tests and document results.

Q: How much does multi-region DR cost?

Costs vary by strategy. Backup and restore: primarily S3 storage costs ($0.023/GB/month). Pilot light: minimal EC2 and RDS costs (stopped/minimal instances). Warm standby: 10-50% of production cost. Multi-site active/active: approximately 100% of production cost plus cross-region data transfer.

Implement disaster recovery on AWS with backup and restore, pilot light, warm standby, and multi-site active/active strategies with RPO/RTO analysis.

CloudToolStack Editorial26 min readPublished Feb 22, 2026

Prerequisites

Understanding of AWS core services (EC2, RDS, S3, VPC)
Familiarity with multi-region architectures
Understanding of the AWS Well-Architected Framework
Experience with infrastructure as code

Understanding Disaster Recovery

Disaster recovery (DR) is the process of preparing for and recovering from events that render your primary infrastructure unavailable. These events range from hardware failures and software bugs to natural disasters and cyberattacks. On AWS, disaster recovery typically means maintaining the ability to recover your applications in a different Availability Zone or Region when the primary location is compromised.

DR is not the same as high availability (HA). High availability focuses on preventing downtime through redundancy within a region (multi-AZ deployments, load balancers, auto scaling). Disaster recovery focuses on recovering from events that affect an entire region or that HA measures cannot handle, such as a region-wide outage, data corruption, or ransomware that encrypts your production data. A well-architected system implements both HA (to handle routine failures) and DR (to handle catastrophic failures).

The right DR strategy is a business decision, not a technical one. It depends on how much downtime you can tolerate (Recovery Time Objective, RTO), how much data loss you can accept (Recovery Point Objective, RPO), and how much you are willing to spend on standby infrastructure. AWS offers four DR strategies along a cost-complexity spectrum, from simple backups ($) to active-active multi-region ($$$$).

Disasters Are Not Just Natural Events

When people think of disaster recovery, they often picture earthquakes or floods. But the most common "disasters" in cloud environments are: accidental data deletion by an engineer, ransomware or security breaches, software bugs that corrupt data, configuration errors that take down services, and (rarely) actual AWS region outages. Your DR strategy should address all of these scenarios, not just infrastructure failures.

RPO & RTO Fundamentals

Every DR strategy is defined by two key metrics: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). These are business requirements that drive your technical architecture and spending decisions.

Recovery Point Objective (RPO) is the maximum acceptable amount of data loss, measured in time. An RPO of 1 hour means you can lose up to 1 hour of data; you need backups or replication that is at most 1 hour behind production. An RPO of zero means no data loss is acceptable, which requires synchronous replication.

Recovery Time Objective (RTO) is the maximum acceptable downtime, how long your application can be unavailable before the business impact becomes unacceptable. An RTO of 4 hours means you need to recover full functionality within 4 hours of a disaster. An RTO of minutes requires hot standby infrastructure ready to take over immediately.

RPO/RTO by DR Strategy

Strategy	RPO	RTO	Cost	Complexity
Backup & Restore	Hours (last backup)	Hours (restore + rebuild)	$ (lowest)	Low
Pilot Light	Minutes (continuous replication)	30-60 minutes (scale up)	$$	Medium
Warm Standby	Minutes (continuous replication)	Minutes (scale up)	$$$	Medium-High
Multi-Site Active/Active	Near zero (synchronous)	Near zero (automatic failover)	$$$$ (highest)	High

Define RPO/RTO Before Choosing a Strategy

A common mistake is choosing a DR strategy based on what feels right technically rather than what the business actually needs. An e-commerce site processing millions in daily revenue needs a different DR strategy than an internal HR tool used during business hours. Work with business stakeholders to define RPO and RTO for each application, then choose the strategy that meets those requirements at the lowest cost. Over-engineering DR wastes money on standby infrastructure that may never be used.

Backup & Restore Strategy

Backup and restore is the simplest and cheapest DR strategy. You take regular backups of your data and infrastructure configurations, store them in a separate region, and rebuild everything from scratch if a disaster occurs. There is no standby infrastructure running in the DR region; you only pay for backup storage until you need to recover.

This strategy has the highest RPO and RTO because recovery requires restoring data from backups and re-provisioning all infrastructure. RPO is limited by backup frequency (hourly backups mean up to 1 hour of data loss), and RTO can be hours depending on the amount of data to restore and infrastructure to provision.

Key AWS Services for Backup

Service	Backup Mechanism	Cross-Region Support
RDS	Automated snapshots, manual snapshots	Cross-region snapshot copy
DynamoDB	On-demand backup, PITR (Point-in-Time Recovery)	Cross-region backup via AWS Backup
EBS	EBS snapshots (stored in S3)	Cross-region snapshot copy
S3	Cross-Region Replication (CRR)	Native cross-region replication
EFS	AWS Backup integration	Cross-region replication
Aurora	Continuous backup, manual snapshots	Aurora Global Database, cross-region snapshot
Redshift	Automated snapshots	Cross-region snapshot copy

aws-backup-plan.yaml

# CloudFormation - AWS Backup plan with cross-region copy
Resources:
  BackupVault:
    Type: AWS::Backup::BackupVault
    Properties:
      BackupVaultName: primary-region-vault
      EncryptionKeyArn: !GetAtt BackupKMSKey.Arn

  DRBackupVault:
    Type: AWS::Backup::BackupVault
    Properties:
      BackupVaultName: dr-region-vault
      # This vault must be created in the DR region
      # Use a stack set or separate stack in the DR region

  BackupPlan:
    Type: AWS::Backup::BackupPlan
    Properties:
      BackupPlan:
        BackupPlanName: production-backup-plan
        BackupPlanRule:
          # Hourly backups with 24-hour retention
          - RuleName: HourlyBackups
            TargetBackupVault: !Ref BackupVault
            ScheduleExpression: "cron(0 * * * ? *)"
            StartWindowMinutes: 60
            CompletionWindowMinutes: 120
            Lifecycle:
              DeleteAfterDays: 1

          # Daily backups with 30-day retention + cross-region copy
          - RuleName: DailyBackupsWithCrossRegion
            TargetBackupVault: !Ref BackupVault
            ScheduleExpression: "cron(0 3 * * ? *)"
            StartWindowMinutes: 60
            CompletionWindowMinutes: 180
            Lifecycle:
              DeleteAfterDays: 30
            CopyActions:
              - DestinationBackupVaultArn: !Sub "arn:aws:backup:us-west-2:${AWS::AccountId}:backup-vault:dr-region-vault"
                Lifecycle:
                  DeleteAfterDays: 30

          # Weekly backups with 1-year retention
          - RuleName: WeeklyBackups
            TargetBackupVault: !Ref BackupVault
            ScheduleExpression: "cron(0 3 ? * SUN *)"
            StartWindowMinutes: 60
            Lifecycle:
              DeleteAfterDays: 365

  BackupSelection:
    Type: AWS::Backup::BackupSelection
    Properties:
      BackupPlanId: !Ref BackupPlan
      BackupSelection:
        SelectionName: production-resources
        IamRoleArn: !GetAtt BackupRole.Arn
        # Back up all resources tagged for DR
        ListOfTags:
          - ConditionType: STRINGEQUALS
            ConditionKey: backup
            ConditionValue: "true"
        Resources:
          - "arn:aws:rds:*:*:db:*"
          - "arn:aws:dynamodb:*:*:table/*"
          - "arn:aws:ec2:*:*:volume/*"

Use AWS Backup for Centralized Management

AWS Backup provides a single console and API to manage backups across RDS, DynamoDB, EBS, EFS, S3, Aurora, and more. Instead of configuring backups individually for each service, define a backup plan with rules and tag your resources for inclusion. AWS Backup handles cross-region copies, retention lifecycle, and compliance reporting. It also supports backup policies through AWS Organizations, ensuring consistent backup practices across all accounts.

Pilot Light Strategy

The pilot light strategy keeps a minimal version of your core infrastructure running in the DR region at all times. Like a furnace pilot light that can ignite the full burner quickly, this minimal footprint can be scaled up to full production capacity when needed. The "core" typically includes database replicas and essential data stores, the hardest and slowest components to rebuild.

In the DR region, you maintain: cross-region database replicas (RDS read replica, Aurora Global Database, DynamoDB Global Tables), core networking (VPC, subnets, security groups), and infrastructure-as-code templates ready to provision compute resources. Compute resources (EC2, ECS, Lambda) are NOT running; they are provisioned only when failover is triggered.

pilot-light.yaml

# Pilot light infrastructure in DR region
Resources:
  # Core data layer - always running
  DRDatabase:
    Type: AWS::RDS::DBInstance
    Properties:
      SourceDBInstanceIdentifier: !Sub "arn:aws:rds:us-east-1:${AWS::AccountId}:db:production-db"
      DBInstanceClass: db.r6g.large  # Can be smaller than production
      MultiAZ: false                  # Single AZ to save cost in standby
      StorageEncrypted: true

  # DynamoDB Global Table (automatically replicated)
  # Note: DynamoDB Global Tables are configured in the primary region
  # and automatically maintain replicas in specified regions

  # VPC and networking - always provisioned
  DRVPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.1.0.0/16
      EnableDnsHostnames: true
      Tags:
        - Key: Name
          Value: dr-vpc

  # AMIs and launch templates - always ready
  DRLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: dr-web-server
      LaunchTemplateData:
        ImageId: !Ref LatestAMI
        InstanceType: t3.large
        SecurityGroupIds:
          - !Ref DRWebSecurityGroup
        UserData:
          Fn::Base64: !Sub |
            #!/bin/bash
            # Application bootstrap script
            aws s3 cp s3://deployment-artifacts/latest/app.tar.gz /opt/app/
            tar xzf /opt/app/app.tar.gz -C /opt/app/
            systemctl start myapp

  # Auto Scaling Group - configured but with 0 instances
  DRAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      AutoScalingGroupName: dr-web-servers
      LaunchTemplate:
        LaunchTemplateId: !Ref DRLaunchTemplate
        Version: !GetAtt DRLaunchTemplate.LatestVersionNumber
      MinSize: 0          # No instances running in standby
      MaxSize: 20
      DesiredCapacity: 0  # Scale up during failover
      VPCZoneIdentifier:
        - !Ref DRPrivateSubnet1
        - !Ref DRPrivateSubnet2
      TargetGroupARNs:
        - !Ref DRTargetGroup

Failover Process

When a disaster is declared, the pilot light failover process is: (1) promote the RDS read replica to a standalone database, (2) scale up the Auto Scaling Group from 0 to the desired capacity, (3) update DNS (Route 53) to point to the DR region's load balancer, and (4) verify the application is healthy. This process can be automated with AWS Lambda and Step Functions, triggered by Route 53 health check failures or manual invocation.

Warm Standby Strategy

Warm standby extends the pilot light approach by running a scaled-down but fully functional copy of your production environment in the DR region. Unlike pilot light (where compute is at zero), warm standby runs a minimum number of instances that can handle a fraction of production traffic. This reduces RTO because the application is already running; failover only requires scaling up and switching DNS.

In a warm standby, you maintain: full database replicas, a reduced-scale compute fleet (e.g., 2 instances instead of 20), load balancers, and all supporting infrastructure. The warm standby can also serve read-only traffic or non-critical workloads during normal operations, partially offsetting its cost.

warm-standby-failover.sh

#!/bin/bash
# Warm standby failover script
# Triggered manually or by automated health check failure

set -e

DR_REGION="us-west-2"
PRIMARY_REGION="us-east-1"
HOSTED_ZONE_ID="Z1234567890"
DR_ALB_DNS="dr-alb-123456.us-west-2.elb.amazonaws.com"
DR_ALB_HOSTED_ZONE="Z35SXDOTRQ7X7K"

echo "=== INITIATING DISASTER RECOVERY FAILOVER ==="
echo "Timestamp: $(date -u '+%Y-%m-%dT%H:%M:%SZ')"

# Step 1: Promote RDS read replica to standalone primary
echo "Step 1: Promoting RDS read replica..."
aws rds promote-read-replica \
  --db-instance-identifier dr-production-db \
  --region $DR_REGION

aws rds wait db-instance-available \
  --db-instance-identifier dr-production-db \
  --region $DR_REGION
echo "RDS replica promoted successfully."

# Step 2: Scale up compute
echo "Step 2: Scaling up compute resources..."
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name dr-web-servers \
  --min-size 4 \
  --max-size 20 \
  --desired-capacity 8 \
  --region $DR_REGION

# Wait for instances to be healthy
echo "Waiting for instances to register with target group..."
aws elbv2 wait target-in-service \
  --target-group-arn "arn:aws:elasticloadbalancing:$DR_REGION:123456789012:targetgroup/dr-tg/1234567890" \
  --region $DR_REGION
echo "Compute scaled and healthy."

# Step 3: Update DNS to point to DR region
echo "Step 3: Updating DNS..."
aws route53 change-resource-record-sets \
  --hosted-zone-id $HOSTED_ZONE_ID \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "'$DR_ALB_HOSTED_ZONE'",
          "DNSName": "'$DR_ALB_DNS'",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'
echo "DNS updated to DR region."

# Step 4: Validate
echo "Step 4: Running health checks..."
for i in {1..5}; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://app.example.com/health)
  echo "Health check $i: HTTP $STATUS"
  sleep 10
done

echo "=== FAILOVER COMPLETE ==="
echo "DR region $DR_REGION is now serving production traffic."

Test Failover Regularly

An untested disaster recovery plan is not a plan; it is a hope. Schedule DR failover tests at least quarterly. A full test should include: promoting the database replica, scaling up compute, switching DNS, running smoke tests against the DR environment, and then failing back to the primary region. Document any issues found during testing and fix them before the next test. Many organizations discover during their first DR test that their runbook is outdated, IAM permissions are missing, or the DR environment has drifted from production.

Multi-Site Active/Active Strategy

Multi-site active/active is the most resilient (and most expensive) DR strategy. Your application runs at full capacity in two or more regions simultaneously, and traffic is distributed across all regions using Route 53 latency-based or weighted routing. If one region fails, traffic is automatically routed to the remaining healthy regions. There is no failover process; each region is independently capable of handling full production traffic.

Active/active requires your application and data layer to handle multi-region writes. This means using globally replicated databases (DynamoDB Global Tables, Aurora Global Database with write forwarding) and designing your application for eventual consistency. Conflict resolution becomes a design concern: what happens when two users update the same record in different regions simultaneously?

Active/Active Architecture Components

Component	Multi-Region Approach	Consistency Model
DNS routing	Route 53 latency-based routing with health checks	N/A
CDN	CloudFront with origin groups per region	Cache TTL-based
Compute	Independent ECS/EKS/Lambda per region	Stateless
Relational DB	Aurora Global Database (write forwarding)	Async replication (typically < 1s lag)
NoSQL DB	DynamoDB Global Tables	Eventually consistent (typically < 1s)
Cache	Independent ElastiCache per region	Independent (no cross-region sync)
Object storage	S3 Cross-Region Replication	Eventually consistent (minutes)
Messaging	SQS/SNS per region, EventBridge cross-region	Independent

active-active-routing.yaml

# Route 53 latency-based routing for active/active
Resources:
  # Health checks for each region
  USEast1HealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTPS
        FullyQualifiedDomainName: us-east-1.internal.example.com
        Port: 443
        ResourcePath: /health
        RequestInterval: 10
        FailureThreshold: 3
        EnableSNI: true

  USWest2HealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTPS
        FullyQualifiedDomainName: us-west-2.internal.example.com
        Port: 443
        ResourcePath: /health
        RequestInterval: 10
        FailureThreshold: 3
        EnableSNI: true

  # Latency-based routing records
  USEast1Record:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: !Ref HostedZone
      Name: app.example.com
      Type: A
      SetIdentifier: us-east-1
      Region: us-east-1
      HealthCheckId: !Ref USEast1HealthCheck
      AliasTarget:
        HostedZoneId: Z35SXDOTRQ7X7K
        DNSName: us-east-1-alb.example.com
        EvaluateTargetHealth: true

  USWest2Record:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: !Ref HostedZone
      Name: app.example.com
      Type: A
      SetIdentifier: us-west-2
      Region: us-west-2
      HealthCheckId: !Ref USWest2HealthCheck
      AliasTarget:
        HostedZoneId: Z1H1FL5HABSF5
        DNSName: us-west-2-alb.example.com
        EvaluateTargetHealth: true

DynamoDB Global Tables for Active/Active

DynamoDB Global Tables automatically replicate data across regions with sub-second latency. Writes to any region are propagated to all other regions. Conflict resolution uses "last writer wins" based on timestamps. For most use cases, this is sufficient. For applications requiring custom conflict resolution (like collaborative document editing), you may need application-level conflict detection using version vectors or CRDTs (Conflict-free Replicated Data Types).

DR Strategy Comparison

Choosing the right DR strategy requires balancing cost, complexity, RPO, and RTO. The following comparison helps visualize the trade-offs:

Factor	Backup & Restore	Pilot Light	Warm Standby	Active/Active
RPO	Hours	Minutes	Minutes	Seconds
RTO	Hours	30-60 min	5-15 min	Near zero
Standby cost	~5% of production	~15% of production	~30-50% of production	~100% of production
Failover automation	Mostly manual	Semi-automated	Automated	Automatic
Data layer in DR	Backups only	Live replicas	Live replicas	Active writes
Compute in DR	Nothing running	Nothing running	Reduced capacity	Full capacity
Best for	Non-critical apps, dev/test	Business apps, moderate SLAs	Important apps, strong SLAs	Mission-critical, zero downtime

Cross-Region Replication Patterns

Regardless of which DR strategy you choose, cross-region data replication is the foundation. Each AWS data service has its own replication mechanism with different consistency guarantees, latency characteristics, and cost models.

Database Replication

bash

# Create an RDS cross-region read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier dr-production-replica \
  --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:production-db \
  --db-instance-class db.r6g.large \
  --region us-west-2 \
  --storage-encrypted \
  --kms-key-id "arn:aws:kms:us-west-2:123456789012:key/dr-key-id"

# Create an Aurora Global Database
aws rds create-global-cluster \
  --global-cluster-identifier my-global-cluster \
  --source-db-cluster-identifier arn:aws:rds:us-east-1:123456789012:cluster:production-aurora \
  --region us-east-1

# Add a secondary region to Aurora Global Database
aws rds create-db-cluster \
  --db-cluster-identifier aurora-dr-cluster \
  --global-cluster-identifier my-global-cluster \
  --engine aurora-postgresql \
  --engine-version 15.4 \
  --region us-west-2

aws rds create-db-instance \
  --db-instance-identifier aurora-dr-instance-1 \
  --db-cluster-identifier aurora-dr-cluster \
  --db-instance-class db.r6g.large \
  --engine aurora-postgresql \
  --region us-west-2

# Enable DynamoDB Global Tables (add region)
aws dynamodb update-table \
  --table-name Orders \
  --replica-updates '[{"Create": {"RegionName": "us-west-2"}}]' \
  --region us-east-1

S3 Cross-Region Replication

s3-replication.yaml

Resources:
  SourceBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: my-app-data-us-east-1
      VersioningConfiguration:
        Status: Enabled  # Required for CRR
      ReplicationConfiguration:
        Role: !GetAtt ReplicationRole.Arn
        Rules:
          - Id: ReplicateAll
            Status: Enabled
            Destination:
              Bucket: !Sub "arn:aws:s3:::my-app-data-us-west-2"
              StorageClass: STANDARD_IA  # Use cheaper storage class in DR
              ReplicationTime:
                Status: Enabled
                Time:
                  Minutes: 15  # S3 Replication Time Control (RTC)
              Metrics:
                Status: Enabled
                EventThreshold:
                  Minutes: 15
            Filter:
              Prefix: ""  # Replicate all objects
            DeleteMarkerReplication:
              Status: Enabled

Replication Is Not Backup

Cross-region replication protects against regional failures but does NOT protect against data corruption or accidental deletion. If an engineer accidentally deletes a DynamoDB table, the deletion is replicated to all regions within seconds. You still need point-in-time backups in addition to replication. Enable DynamoDB PITR, RDS automated snapshots, and S3 versioning to protect against data-level disasters.

AWS Elastic Disaster Recovery

AWS Elastic Disaster Recovery (AWS DRS, formerly CloudEndure Disaster Recovery) is a managed service that provides continuous block-level replication of your source servers to AWS. It supports any server that can run the DRS agent, including on-premises physical servers, VMware VMs, and EC2 instances in another region.

DRS continuously replicates your server disks to a staging area in the target region. When you initiate a recovery, DRS launches EC2 instances from the replicated data, converting the source server configuration (network, instance type, security groups) to the target region. The RPO is typically measured in seconds (the replication lag), and the RTO is minutes (the time to launch EC2 instances from replicated snapshots).

DRS vs Traditional DR Approaches

Aspect	AWS DRS	AMI-Based DR	Infrastructure-as-Code DR
RPO	Seconds (continuous replication)	Hours (snapshot frequency)	Depends on data replication
RTO	Minutes	30-60 minutes	30-60 minutes
What is replicated	Full server state (OS + data + config)	AMI snapshot (point-in-time)	Infrastructure definition only
Ongoing cost	$0.028/server/hour + staging storage	Snapshot storage only	Minimal (templates in S3)
Ideal for	Lift-and-shift, legacy apps	Stateless applications	Cloud-native applications

bash

# Install the DRS agent on a source server
wget -O aws-replication-installer-init   https://aws-elastic-disaster-recovery-us-west-2.s3.us-west-2.amazonaws.com/latest/linux/aws-replication-installer-init

chmod +x aws-replication-installer-init

sudo ./aws-replication-installer-init \
  --region us-west-2 \
  --aws-access-key-id AKIA... \
  --aws-secret-access-key ... \
  --no-prompt

# Initiate a recovery drill (non-destructive test)
aws drs start-recovery \
  --source-servers '[{"sourceServerID": "s-1234567890abcdef0"}]' \
  --is-drill true \
  --region us-west-2

# Monitor recovery job
aws drs describe-jobs \
  --filters '{"jobIDs": ["job-1234567890abcdef0"]}' \
  --region us-west-2

DRS for Non-Cloud-Native Workloads

AWS DRS is particularly valuable for legacy applications that cannot be easily re-architected for cloud-native DR. If you have a monolithic application running on EC2 that stores state on local EBS volumes, DRS provides sub-minute RPO without modifying the application. For cloud-native applications (containerized, serverless, using managed databases), infrastructure-as-code with database replication is typically simpler and cheaper than DRS.

Testing & Runbooks

A disaster recovery plan is only as good as its last test. Regular DR testing validates that your failover procedures work, your team knows what to do, and your RTO/RPO objectives can actually be met. Without testing, you are likely to discover problems during a real disaster, exactly when you can least afford them.

DR Testing Types

Test Type	What It Tests	Impact	Frequency
Tabletop exercise	Process and communication	None (discussion only)	Quarterly
Walkthrough drill	Runbook accuracy (execute steps in non-prod)	Low (non-production)	Quarterly
Parallel test	DR environment functionality (no traffic switch)	Medium (DR resources used)	Semi-annually
Full failover test	Complete failover and failback	High (production traffic moved)	Annually
Chaos engineering	Resilience under unexpected conditions	Variable	Ongoing

DR Runbook Template

A runbook is a step-by-step procedure for executing a failover. Every DR plan should have a documented runbook that is version-controlled, reviewed regularly, and practiced by the on-call team. The runbook should be executable by any engineer on the team, not just the person who wrote it.

dr-runbook-checklist.md

# Disaster Recovery Runbook - Production Failover

## Prerequisites
- [ ] Access to AWS Console with DR admin role
- [ ] Access to DNS management (Route 53)
- [ ] Communication channels established (Slack #incident, PagerDuty)
- [ ] DR runbook reviewed within last 90 days

## Decision Criteria
- Primary region health check failing for > 5 minutes
- AWS Health Dashboard showing regional issue
- Data corruption detected in primary region
- Approved by: VP of Engineering or on-call incident commander

## Failover Steps
1. [ ] Declare incident in Slack #incident channel
2. [ ] Verify DR region health: curl https://dr.internal.example.com/health
3. [ ] Promote RDS read replica: aws rds promote-read-replica ...
4. [ ] Wait for DB promotion: aws rds wait db-instance-available ...
5. [ ] Scale up compute: aws autoscaling update-auto-scaling-group ...
6. [ ] Wait for targets healthy: aws elbv2 wait target-in-service ...
7. [ ] Update application config to point to promoted DB
8. [ ] Switch DNS: aws route53 change-resource-record-sets ...
9. [ ] Verify application health: run smoke test suite
10. [ ] Monitor error rates for 15 minutes
11. [ ] Communicate status to stakeholders

## Failback Steps (after primary region recovery)
1. [ ] Verify primary region is healthy
2. [ ] Re-establish replication from DR to primary
3. [ ] Wait for replication to catch up
4. [ ] Switch DNS back to primary region
5. [ ] Scale down DR compute
6. [ ] Re-establish primary-to-DR replication
7. [ ] Conduct post-incident review

## Contacts
- On-call: PagerDuty rotation "Platform Engineering"
- Escalation: VP Engineering (Jane Doe)
- AWS Support: Enterprise support case (Severity 1)

Automate Everything You Can

Manual failover steps introduce delay and human error during the most stressful moments. Automate as much of your failover procedure as possible using AWS Step Functions, Lambda, and Systems Manager Automation runbooks. A fully automated failover triggered by Route 53 health check failures can achieve an RTO of minutes, while a manual runbook typically takes 30-60 minutes assuming the on-call engineer is awake and available. Keep manual approval gates only for the initial "declare disaster" decision. Once declared, everything else should be automated.

Post-Incident Review

After every DR event (real or test), conduct a blameless post-incident review. Document what happened, what the timeline was, what worked well, what did not work, and what changes are needed. Update the runbook with lessons learned. Common findings include: runbook steps that are outdated, IAM permissions that are missing, configuration drift between primary and DR environments, and undocumented dependencies that broke during failover.

Track your actual RTO and RPO achieved during tests and compare them to your targets. If your target RTO is 15 minutes but your test took 45 minutes, you have a gap that needs addressing, either through automation, pre-provisioning, or resetting expectations with stakeholders.

Well-Architected Framework: Reliability Pillar for DR Planning Multi-Cloud Disaster Recovery: Comparing DR Across Providers

Key Takeaways

1RPO (Recovery Point Objective) defines maximum acceptable data loss; RTO (Recovery Time Objective) defines maximum acceptable downtime.
2Backup and restore is cheapest but has highest RTO (hours); multi-site active/active is most expensive but has near-zero RTO.
3Pilot light keeps core infrastructure running in the DR region at minimal cost.
4Warm standby runs a scaled-down version of the production environment for faster recovery.
5AWS Elastic Disaster Recovery (DRS) provides continuous replication with sub-second RPO.
6Regular DR testing is essential because untested disaster recovery plans will fail when needed.

Frequently Asked Questions

What is the difference between RPO and RTO?

RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time, or how far back you can afford to lose data. RTO (Recovery Time Objective) is the maximum acceptable downtime, or how quickly you need to recover. Both drive your DR strategy choice and cost.

Which DR strategy should I choose?

It depends on your RPO/RTO requirements and budget. Backup and restore: RPO hours, RTO hours, lowest cost. Pilot light: RPO minutes, RTO 10s of minutes, low cost. Warm standby: RPO seconds, RTO minutes, moderate cost. Multi-site active/active: RPO near-zero, RTO near-zero, highest cost.

What is AWS Elastic Disaster Recovery?

AWS Elastic Disaster Recovery (AWS DRS, formerly CloudEndure) provides continuous block-level replication of servers to AWS. It maintains staging resources at minimal cost and can launch recovery instances within minutes. It supports both cloud-to-cloud and on-premises-to-cloud DR.

How do I test my DR plan?

Use non-disruptive DR drills: launch recovery instances in the DR region without affecting production, validate applications work correctly, then terminate test resources. AWS DRS supports drill mode natively. Schedule quarterly DR tests and document results.

How much does multi-region DR cost?

Costs vary by strategy. Backup and restore: primarily S3 storage costs ($0.023/GB/month). Pilot light: minimal EC2 and RDS costs (stopped/minimal instances). Warm standby: 10-50% of production cost. Multi-site active/active: approximately 100% of production cost plus cross-region data transfer.

Written by CloudToolStack Editorial

Written and reviewed by the CloudToolStack editorial team. Every guide is verified against current provider documentation and revised in place when providers change pricing, deprecate services, or release meaningfully better alternatives.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.