Skip to main content
AzureArchitectureadvanced

Disaster Recovery with Site Recovery

Implement disaster recovery on Azure with Site Recovery, geo-replication, paired regions, Traffic Manager failover, and DR testing.

CloudToolStack Team26 min readPublished Feb 22, 2026

Prerequisites

Disaster Recovery on Azure

Disaster recovery (DR) is the set of policies, tools, and procedures designed to enable the recovery of critical technology infrastructure and systems following a natural or human-induced disaster. In cloud computing, DR goes beyond simply backing up data. It encompasses the ability to restore entire application stacks, including compute resources, networking, data, and configurations, in a secondary location within acceptable time and data loss parameters.

Azure provides a comprehensive set of services for building disaster recovery solutions, ranging from simple cross-region data replication to fully automated failover of entire application environments. The right DR strategy depends on your application's criticality, your organization's tolerance for downtime and data loss, and the budget available for DR infrastructure.

This guide covers the fundamental concepts of disaster recovery, Azure's DR services (with a deep focus on Azure Site Recovery), and practical implementation patterns for building resilient architectures. We will cover RPO and RTO planning, replication strategies, geo-replication for databases, Traffic Manager integration, DR testing, and cost optimization for DR deployments.

Business Continuity vs Disaster Recovery

Business Continuity (BC) and Disaster Recovery (DR) are related but distinct concepts. Business continuity is the broader practice of ensuring critical business functions continue during and after a disruption. Disaster recovery is a subset of business continuity focused specifically on restoring IT systems and data. An effective DR plan is just one component of a comprehensive business continuity strategy that also includes people, processes, communication plans, and organizational governance.

RPO & RTO Fundamentals

Two metrics define the core requirements of any disaster recovery plan: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Understanding these metrics is essential because they drive every architectural decision in your DR strategy, from how frequently data is replicated to how quickly failover must complete.

MetricDefinitionAnswers the QuestionDrives Decisions About
RPO (Recovery Point Objective)Maximum acceptable amount of data loss measured in time"How much data can we afford to lose?"Replication frequency, backup schedules, data consistency
RTO (Recovery Time Objective)Maximum acceptable time to restore service after a disaster"How long can we be down?"Failover automation, warm vs cold standby, DNS TTL

DR Tiers

DR strategies can be categorized into tiers based on their RPO/RTO targets and associated costs. Lower RPO and RTO values require more infrastructure investment but provide better protection.

TierStrategyRPORTORelative Cost
Tier 1Backup & RestoreHours to daysHours to days$ (lowest)
Tier 2Pilot LightMinutes to hoursHours$$
Tier 3Warm StandbyMinutesMinutes to hours$$$
Tier 4Hot Standby / Active-ActiveNear zeroSeconds to minutes$$$$ (highest)

Cost vs Risk Trade-off

DR is fundamentally a business decision, not just a technical one. The cost of DR infrastructure must be weighed against the cost of downtime. A business that loses $100,000 per hour of downtime can justify spending significantly more on DR than one that loses $1,000 per hour. Work with business stakeholders to establish RPO/RTO requirements based on actual business impact analysis before designing your DR architecture.

Azure Site Recovery Overview

Azure Site Recovery (ASR) is Azure's primary disaster recovery service. It provides automated replication, failover, and recovery for virtual machines, physical servers, and some PaaS workloads. ASR continuously replicates VMs from a primary region to a secondary region, maintaining replica VMs that can be activated within minutes when a disaster occurs.

What ASR Replicates

SourceTargetReplication Method
Azure VMsAnother Azure regionContinuous replication via ASR agent
VMware VMsAzureVia ASR process server on-premises
Hyper-V VMsAzureVia Hyper-V Replica + ASR provider
Physical ServersAzureVia ASR mobility agent
Terminal: Set up Azure Site Recovery for Azure VMs
# Create a Recovery Services vault in the DR region
az backup vault create \
  --resource-group rg-dr-westus \
  --name rsv-dr-westus \
  --location westus2

# Note: ASR configuration is primarily done through the portal or PowerShell
# because the Azure CLI has limited ASR support.
# The following uses Azure PowerShell for full ASR configuration:

# PowerShell: Set up ASR for Azure-to-Azure VM replication
# Connect-AzAccount

# Set the vault context
$vault = Get-AzRecoveryServicesVault -Name "rsv-dr-westus" -ResourceGroupName "rg-dr-westus"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault

# Create ASR fabric for source and target regions
$sourceFabric = New-AzRecoveryServicesAsrFabric \
  -Name "asr-fabric-eastus" \
  -Azure \
  -Location "eastus"

$targetFabric = New-AzRecoveryServicesAsrFabric \
  -Name "asr-fabric-westus2" \
  -Azure \
  -Location "westus2"

# Create protection containers
$sourceContainer = New-AzRecoveryServicesAsrProtectionContainer \
  -Name "asr-container-eastus" \
  -Fabric $sourceFabric

$targetContainer = New-AzRecoveryServicesAsrProtectionContainer \
  -Name "asr-container-westus2" \
  -Fabric $targetFabric

# Create replication policy
$policy = New-AzRecoveryServicesAsrPolicy \
  -Name "24-hour-retention" \
  -ReplicationProvider "A2A" \
  -RecoveryPointRetentionInHours 24 \
  -ApplicationConsistentSnapshotFrequencyInHours 4 \
  -MultiVmSyncStatus Enable

# Create container mapping (source -> target)
New-AzRecoveryServicesAsrProtectionContainerMapping \
  -Name "eastus-to-westus2" \
  -Policy $policy \
  -PrimaryProtectionContainer $sourceContainer \
  -RecoveryProtectionContainer $targetContainer

Enabling Replication for a VM

PowerShell: Enable VM replication
# Get the source VM
$vm = Get-AzVM -ResourceGroupName "rg-app-prod" -Name "vm-webserver-01"

# Get the disk details
$osDisk = New-AzRecoveryServicesAsrAzureToAzureDiskReplicationConfig `
  -ManagedDisk `
  -LogStorageAccountId "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus/providers/Microsoft.Storage/storageAccounts/stasrcache" `
  -DiskId $vm.StorageProfile.OsDisk.ManagedDisk.Id `
  -RecoveryResourceGroupId "/subscriptions/<sub-id>/resourceGroups/rg-app-dr" `
  -RecoveryReplicaDiskAccountType "Premium_LRS" `
  -RecoveryTargetDiskAccountType "Premium_LRS"

# Enable replication
New-AzRecoveryServicesAsrReplicationProtectedItem `
  -AzureToAzure `
  -AzureVmId $vm.Id `
  -Name "vm-webserver-01-asr" `
  -ProtectionContainerMapping $containerMapping `
  -AzureToAzureDiskReplicationConfiguration @($osDisk) `
  -RecoveryResourceGroupId "/subscriptions/<sub-id>/resourceGroups/rg-app-dr" `
  -RecoveryAvailabilityZone "1" `
  -RecoveryAzureNetworkId "/subscriptions/<sub-id>/resourceGroups/rg-network-dr/providers/Microsoft.Network/virtualNetworks/vnet-dr" `
  -RecoveryAzureSubnetName "snet-app"

# Monitor replication health
Get-AzRecoveryServicesAsrReplicationProtectedItem `
  -ProtectionContainer $sourceContainer | `
  Select-Object FriendlyName, ProtectionState, ReplicationHealth, `
  TestFailoverState, ActiveLocation

Replication & Recovery Plans

Recovery Plans in Azure Site Recovery define the sequence of steps for failing over an entire application stack. Instead of failing over individual VMs one at a time, a recovery plan groups VMs into ordered groups with dependencies, scripts, and manual actions, ensuring your application recovers in the correct order.

Recovery Plan Design

A well-designed recovery plan reflects your application's startup dependencies. For example, a typical three-tier application should start databases first (Group 1), then application servers (Group 2), and finally web frontends (Group 3). Between groups, you can add automation scripts that perform tasks like updating connection strings, warming caches, or running database migrations.

PowerShell: Create a recovery plan
# Create a recovery plan with ordered groups
$plan = New-AzRecoveryServicesAsrRecoveryPlan `
  -Name "rp-myapp-full" `
  -PrimaryFabric $sourceFabric `
  -RecoveryFabric $targetFabric `
  -ReplicationProtectedItem @($dbVm, $appVm1, $appVm2, $webVm1, $webVm2)

# Edit the plan to define startup order
# Group 1: Database servers (start first)
# Group 2: Application servers (start after DB)
# Group 3: Web servers (start last)

$plan = Edit-AzRecoveryServicesAsrRecoveryPlan -RecoveryPlan $plan

# Add a pre-action script to Group 2 (runs before app servers start)
# This script could update connection strings or validate database availability
$scriptAction = New-AzRecoveryServicesAsrRecoveryPlanAction `
  -Name "Validate-Database" `
  -RunbookId "/subscriptions/<sub-id>/resourceGroups/rg-automation/providers/Microsoft.Automation/automationAccounts/aa-dr/runbooks/Validate-DatabaseConnection" `
  -FabricSide "Primary" `
  -ActionType "AutomationRunbook"

# Add a post-action to Group 3 (runs after web servers start)
$healthCheck = New-AzRecoveryServicesAsrRecoveryPlanAction `
  -Name "Health-Check" `
  -RunbookId "/subscriptions/<sub-id>/resourceGroups/rg-automation/providers/Microsoft.Automation/automationAccounts/aa-dr/runbooks/Run-HealthCheck" `
  -FabricSide "Primary" `
  -ActionType "AutomationRunbook"

# View recovery plan details
Get-AzRecoveryServicesAsrRecoveryPlan -Name "rp-myapp-full" | `
  Select-Object Name, @{N="Groups"; E={$_.Groups.Count}}, `
  @{N="VMs"; E={($_.Groups | ForEach-Object { $_.ReplicationProtectedItems }).Count}}

Automation Runbooks in Recovery Plans

Azure Automation Runbooks integrated into recovery plans are the key to achieving fully automated failover. Common automation tasks include: updating DNS records, modifying connection strings in App Settings, disabling scheduled jobs in the primary region, notifying operations teams via webhook, running database failover commands, and executing health checks after each group completes. Invest time in building and testing these automations; they are the difference between a 5-minute automated failover and a multi-hour manual recovery process.

Availability Zones & Paired Regions

Azure's physical infrastructure is organized into regions, each containing multiple availability zones. Understanding this geography is fundamental to DR planning because it determines which failure scenarios your architecture can survive and what replication options are available.

Availability Zones

Each Azure region with availability zone support has at least three physically separated data centers (zones) with independent power, cooling, and networking. Deploying across zones protects against data center failures within a single region. Zone-redundant deployments provide 99.99% SLA for VMs and are the foundation of high availability within a region.

Paired Regions

Azure pairs most regions with another region in the same geography (typically 300+ miles apart). Paired regions receive prioritized recovery during widespread outages, sequential platform updates (to avoid simultaneous failures), and physical isolation to minimize the chance of a single event affecting both regions.

Primary RegionPaired RegionGeography
East USWest USUnited States
East US 2Central USUnited States
North EuropeWest EuropeEurope
UK SouthUK WestUnited Kingdom
Southeast AsiaEast AsiaAsia Pacific
Australia EastAustralia SoutheastAustralia
bicep: Zone-redundant deployment
// Deploy VMs across availability zones for intra-region HA
param location string = 'eastus'
param vmCount int = 3

resource availabilityZoneVMs 'Microsoft.Compute/virtualMachines@2023-09-01' = [for i in range(0, vmCount): {
  name: 'vm-web-${padLeft(string(i + 1), 2, '0')}'
  location: location
  zones: [string((i % 3) + 1)] // Distribute across zones 1, 2, 3
  properties: {
    hardwareProfile: {
      vmSize: 'Standard_D4s_v5'
    }
    storageProfile: {
      osDisk: {
        createOption: 'FromImage'
        managedDisk: {
          storageAccountType: 'Premium_ZRS' // Zone-redundant storage
        }
      }
      imageReference: {
        publisher: 'Canonical'
        offer: '0001-com-ubuntu-server-jammy'
        sku: '22_04-lts-gen2'
        version: 'latest'
      }
    }
    osProfile: {
      computerName: 'vm-web-${padLeft(string(i + 1), 2, '0')}'
      adminUsername: 'azureuser'
      linuxConfiguration: {
        disablePasswordAuthentication: true
        ssh: {
          publicKeys: [
            {
              path: '/home/azureuser/.ssh/authorized_keys'
              keyData: loadTextContent('id_rsa.pub')
            }
          ]
        }
      }
    }
    networkProfile: {
      networkInterfaces: [
        {
          id: nics[i].id
        }
      ]
    }
  }
}]

// Zone-redundant load balancer
resource loadBalancer 'Microsoft.Network/loadBalancers@2023-06-01' = {
  name: 'lb-web-prod'
  location: location
  sku: {
    name: 'Standard'  // Standard SKU required for zone redundancy
    tier: 'Regional'
  }
  properties: {
    frontendIPConfigurations: [
      {
        name: 'frontend'
        zones: ['1', '2', '3']  // Zone-redundant frontend
        properties: {
          publicIPAddress: {
            id: publicIP.id
          }
        }
      }
    ]
  }
}

Azure Backup Integration

Azure Backup provides simple, secure, and cost-effective solutions for backing up data and recovering it from Azure. While Azure Site Recovery handles live replication for rapid failover, Azure Backup handles point-in-time snapshots for data recovery from accidental deletion, corruption, or ransomware. A complete DR strategy typically uses both services: ASR for infrastructure failover and Azure Backup for data protection.

Backup Scope

Resource TypeBackup MethodRetention
Azure VMsSnapshot-based (agentless)Up to 9999 days
Azure SQL DatabaseAutomated (built-in PITR)7–35 days (PITR), long-term retention available
Azure FilesShare snapshots via BackupConfigurable up to 9999 days
Azure BlobsOperational/vaulted backupConfigurable, cross-region with vault
SQL Server in VMWorkload-aware backup (agent)Configurable, log backups every 15 min
SAP HANA in VMBackint integrationConfigurable
Azure Kubernetes ServiceExtension-based backupConfigurable
Terminal: Configure Azure Backup for VMs
# Create a Recovery Services vault for backups
az backup vault create \
  --resource-group rg-backup \
  --name rsv-backup-prod \
  --location eastus

# Create a backup policy
az backup policy create \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --name policy-vm-daily \
  --backup-management-type AzureIaasVM \
  --policy '{
    "schedulePolicy": {
      "schedulePolicyType": "SimpleSchedulePolicy",
      "scheduleRunFrequency": "Daily",
      "scheduleRunTimes": ["2024-01-01T02:00:00Z"]
    },
    "retentionPolicy": {
      "retentionPolicyType": "LongTermRetentionPolicy",
      "dailySchedule": {
        "retentionTimes": ["2024-01-01T02:00:00Z"],
        "retentionDuration": { "count": 30, "durationType": "Days" }
      },
      "weeklySchedule": {
        "daysOfTheWeek": ["Sunday"],
        "retentionTimes": ["2024-01-01T02:00:00Z"],
        "retentionDuration": { "count": 12, "durationType": "Weeks" }
      },
      "monthlySchedule": {
        "retentionScheduleFormatType": "Weekly",
        "retentionScheduleWeekly": {
          "daysOfTheWeek": ["Sunday"],
          "weeksOfTheMonth": ["First"]
        },
        "retentionTimes": ["2024-01-01T02:00:00Z"],
        "retentionDuration": { "count": 12, "durationType": "Months" }
      }
    }
  }'

# Enable backup for a VM
az backup protection enable-for-vm \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --vm /subscriptions/<sub-id>/resourceGroups/rg-app/providers/Microsoft.Compute/virtualMachines/vm-webserver-01 \
  --policy-name policy-vm-daily

# Trigger an on-demand backup
az backup protection backup-now \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --container-name "IaasVMContainerV2;rg-app;vm-webserver-01" \
  --item-name "vm-webserver-01" \
  --retain-until 2024-02-15

Traffic Manager for DR

Azure Traffic Manager is a DNS-based global traffic distribution service that plays a critical role in disaster recovery architectures. By routing users to the healthy regional deployment based on health probes, Traffic Manager enables automatic failover at the DNS level. When the primary region becomes unhealthy, Traffic Manager stops resolving to that region's endpoints and directs all traffic to the secondary region.

Traffic Manager Routing Methods for DR

MethodHow It WorksDR Use Case
PriorityRoutes to highest priority endpoint; failover to next on failureActive/passive DR with clear primary
PerformanceRoutes to the endpoint with lowest network latencyActive/active with automatic region selection
GeographicRoutes based on the user's geographic locationData residency compliance + DR
WeightedDistributes traffic proportionally across endpointsGradual failover / canary DR testing
Terminal: Configure Traffic Manager for DR failover
# Create a Traffic Manager profile with priority routing
az network traffic-manager profile create \
  --resource-group rg-dr \
  --name tm-myapp-dr \
  --routing-method Priority \
  --unique-dns-name myapp-global \
  --ttl 60 \
  --protocol HTTPS \
  --port 443 \
  --path "/health" \
  --interval 10 \
  --timeout 5 \
  --max-failures 3

# Add primary endpoint (East US)
az network traffic-manager endpoint create \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-eastus-primary \
  --type azureEndpoints \
  --target-resource-id /subscriptions/<sub-id>/resourceGroups/rg-app-eastus/providers/Microsoft.Web/sites/myapp-eastus \
  --priority 1 \
  --endpoint-status Enabled

# Add secondary endpoint (West US - DR site)
az network traffic-manager endpoint create \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-westus-secondary \
  --type azureEndpoints \
  --target-resource-id /subscriptions/<sub-id>/resourceGroups/rg-app-westus/providers/Microsoft.Web/sites/myapp-westus \
  --priority 2 \
  --endpoint-status Enabled

# Check Traffic Manager endpoint health
az network traffic-manager endpoint show \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-eastus-primary \
  --type azureEndpoints \
  --query '{Name:name, Status:endpointStatus, MonitorStatus:endpointMonitorStatus, Priority:priority}' \
  --output table

# Simulate failover: disable the primary endpoint
az network traffic-manager endpoint update \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-eastus-primary \
  --type azureEndpoints \
  --endpoint-status Disabled

DNS TTL and Failover Time

Traffic Manager failover time is affected by the DNS TTL and the health probe interval. With a 60-second TTL, 10-second probe interval, and 3 tolerated failures, the worst-case failover time is approximately 90 seconds (30 seconds for probe failure detection plus 60 seconds for DNS cache expiry). For faster failover, reduce the TTL (minimum 0 seconds, though most resolvers enforce a floor) and probe interval. Note that lower TTLs increase DNS query volume, which may slightly increase costs.

Database Geo-Replication

Database geo-replication is often the most complex component of a DR strategy because databases contain the application's state and must be consistent after failover. Azure provides built-in geo-replication for most managed database services, but each service has different replication characteristics and failover procedures.

ServiceGeo-Replication TypeFailoverData Loss Risk
Azure SQL DatabaseActive geo-replication or auto-failover groupsAutomatic (failover groups) or manualMinimal (async, typically < 5 seconds lag)
Azure Cosmos DBMulti-region writes or single-write with read replicasAutomatic (service-managed) or manualConfigurable via consistency level
Azure Database for PostgreSQLRead replicas across regionsManual (promote replica)Depends on replication lag
Azure Cache for RedisPassive (Premium) or active (Enterprise)Manual (Premium) or automatic (Enterprise)Depends on replication lag
Azure StorageGRS/GZRS (async cross-region replication)Manual (initiate account failover)Up to 15 minutes RPO
Terminal: Configure Azure SQL auto-failover group
# Create a primary SQL server (East US)
az sql server create \
  --resource-group rg-data-eastus \
  --name sql-myapp-eastus \
  --location eastus \
  --admin-user sqladmin \
  --admin-password "<strong-password>"

# Create a secondary SQL server (West US) for DR
az sql server create \
  --resource-group rg-data-westus \
  --name sql-myapp-westus \
  --location westus \
  --admin-user sqladmin \
  --admin-password "<strong-password>"

# Create a database on the primary server
az sql db create \
  --resource-group rg-data-eastus \
  --server sql-myapp-eastus \
  --name db-myapp \
  --service-objective S3 \
  --backup-storage-redundancy Geo

# Create an auto-failover group
az sql failover-group create \
  --resource-group rg-data-eastus \
  --server sql-myapp-eastus \
  --partner-server sql-myapp-westus \
  --partner-resource-group rg-data-westus \
  --name fog-myapp \
  --add-db db-myapp \
  --failover-policy Automatic \
  --grace-period 1

# The failover group creates a listener endpoint:
# fog-myapp.database.windows.net (always points to primary)
# fog-myapp.secondary.database.windows.net (always points to secondary)
# Your application should connect to the listener endpoint

# Test failover (manual trigger)
az sql failover-group set-primary \
  --resource-group rg-data-westus \
  --server sql-myapp-westus \
  --name fog-myapp

# Fail back to original primary
az sql failover-group set-primary \
  --resource-group rg-data-eastus \
  --server sql-myapp-eastus \
  --name fog-myapp

DR Testing & Drills

A disaster recovery plan that has never been tested is not a plan; it is a hope. Regular DR testing validates that your recovery procedures work, identifies gaps in automation, trains your team on the failover process, and provides evidence for compliance auditors. Azure Site Recovery supports test failovers that create isolated recovery VMs without impacting production or replication.

Types of DR Tests

Test TypeImpactFrequencyScope
Tabletop ExerciseNone (discussion only)QuarterlyWalk through scenarios, review runbooks
Test Failover (ASR)None (isolated network)MonthlyValidate replication, test recovery plans
Planned FailoverControlled downtimeBi-annuallyFull end-to-end failover with DNS switch
Unannounced DrillReal failoverAnnuallyChaos engineering; test team readiness
PowerShell: Execute a test failover
# Create an isolated virtual network for the test
$testVnet = New-AzVirtualNetwork `
  -Name "vnet-dr-test" `
  -ResourceGroupName "rg-dr-testing" `
  -Location "westus2" `
  -AddressPrefix "10.99.0.0/16"

Add-AzVirtualNetworkSubnetConfig `
  -Name "snet-test" `
  -VirtualNetwork $testVnet `
  -AddressPrefix "10.99.1.0/24"

$testVnet | Set-AzVirtualNetwork

# Execute test failover using the recovery plan
$plan = Get-AzRecoveryServicesAsrRecoveryPlan -Name "rp-myapp-full"

Start-AzRecoveryServicesAsrTestFailoverJob `
  -RecoveryPlan $plan `
  -Direction PrimaryToRecovery `
  -AzureVMNetworkId $testVnet.Id

# Monitor test failover progress
Get-AzRecoveryServicesAsrJob | `
  Where-Object { $_.State -eq "InProgress" } | `
  Select-Object Name, State, StartTime, `
  @{N="Duration"; E={(Get-Date) - $_.StartTime}}

# After validation: Clean up test failover resources
Start-AzRecoveryServicesAsrTestFailoverCleanupJob `
  -RecoveryPlan $plan `
  -Comment "DR test completed successfully. All services validated."

DR Test Automation

Automate your DR tests using Azure DevOps Pipelines or Azure Automation. Create a scheduled pipeline that performs a test failover monthly, runs automated validation checks (HTTP health endpoints, database connectivity, data integrity), captures screenshots of the running application, and sends a report to the DR team. This ensures DR is tested consistently without relying on manual scheduling and execution.

Best Practices & Cost Optimization

Building and maintaining a disaster recovery capability requires ongoing investment in infrastructure, automation, and testing. The following best practices help you build effective DR while managing costs.

Architecture Best Practices

  • Use PaaS services where possible: Platform services like Azure SQL Database, Cosmos DB, and App Service have built-in geo-replication and failover capabilities that are simpler and more reliable than managing DR for IaaS VMs.
  • Automate everything: Manual DR procedures are slow, error-prone, and stressful during an actual disaster. Invest in automation through recovery plans, runbooks, and infrastructure-as-code so that failover is a button press, not a multi-page runbook.
  • Design for failback: Failing over to the DR region is only half the challenge. Plan and test the failback process (returning to the primary region) before you need it. Failback is often more complex because data has been modified in the DR region during the outage.
  • Document everything: Maintain up-to-date DR documentation including architecture diagrams, runbooks, contact lists, escalation procedures, and configuration details. Store documentation in a location accessible during an outage (not solely in the primary region).
  • Test regularly: Conduct DR tests at least quarterly using a mix of tabletop exercises, test failovers, and planned failovers. Each test should produce a report documenting what worked, what did not, and action items for improvement.

Cost Optimization

  • Use Azure Site Recovery for VM replication: ASR costs approximately $25/month per replicated VM, which is far cheaper than running idle VMs in a warm standby DR region.
  • Right-size DR resources: The DR region does not need to match the production region's capacity. Use smaller VM sizes in the DR region and scale up as part of the failover process (at the cost of a slightly longer RTO).
  • Leverage reserved instances strategically: If you use Reserved Instances in your primary region, Azure allows you to use the same reservations in the DR region during failover (instance size flexibility).
  • Use Azure Hybrid Benefit: Apply existing Windows Server and SQL Server licenses to DR VMs to reduce compute costs.
  • Monitor DR costs separately: Tag all DR resources with a consistent tag (e.g., purpose: disaster-recovery) to track DR costs independently and ensure they remain proportional to the risk they mitigate.
  • Review DR scope regularly: As your application evolves, some components may become less critical while new critical services are added. Review your DR scope quarterly to ensure you are protecting the right things.
Terminal: Monitor DR health and costs
# Check ASR replication health for all VMs
az resource list \
  --resource-type "Microsoft.RecoveryServices/vaults" \
  --query "[].{Name:name, ResourceGroup:resourceGroup, Location:location}" \
  --output table

# View backup jobs status
az backup job list \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --query "[?status=='InProgress' || status=='Failed'].{Job:name, Status:status, StartTime:startTime}" \
  --output table

# Check Traffic Manager endpoint health
az network traffic-manager profile show \
  --resource-group rg-dr \
  --name tm-myapp-dr \
  --query '{Profile:name, Status:profileStatus, Endpoints:endpoints[].{Name:name, Status:endpointStatus, Monitor:endpointMonitorStatus}}' \
  --output json

# Estimate DR costs with resource tags
az cost management query \
  --type Usage \
  --timeframe MonthToDate \
  --dataset-filter '{"Tags": {"Name": "purpose", "Values": ["disaster-recovery"]}}' \
  --query "properties.rows[].{Service:[0], Cost:[1]}" \
  --output table

Azure Chaos Studio

Azure Chaos Studio is a managed chaos engineering service that helps you test your application's resilience by deliberately injecting faults, such as VM shutdowns, network disconnections, DNS failures, and CPU/memory stress. Use Chaos Studio alongside your DR testing program to validate that your application degrades gracefully under partial failures and that your DR automation triggers correctly when real infrastructure problems occur.

Key Takeaways

  1. 1Azure Site Recovery (ASR) provides continuous replication of VMs to a secondary region.
  2. 2Paired regions ensure data residency compliance and prioritized recovery during outages.
  3. 3Recovery plans automate multi-tier application failover with custom scripts and sequencing.
  4. 4Azure Backup provides application-consistent snapshots for RPOs from hours to minutes.
  5. 5Traffic Manager and Front Door provide DNS-based and Layer 7 failover respectively.
  6. 6Regular DR drills using test failover validate recovery procedures without affecting production.

Frequently Asked Questions

What is Azure Site Recovery?
Azure Site Recovery (ASR) is a disaster recovery service that replicates VMs, physical servers, and workloads to a secondary Azure region. It provides continuous replication with RPO of seconds, automated failover, and recovery plans for multi-tier applications.
What are Azure paired regions?
Paired regions are two Azure regions in the same geography (e.g., East US and West US) that are physically separated but guaranteed to have replication links. Azure prioritizes recovery of paired regions during widespread outages and ensures at least one region in each pair is updated at a time.
How does test failover work?
Test failover creates recovery VMs in an isolated network in the DR region without affecting production replication. You validate that applications work correctly, then clean up test resources. This allows regular DR testing without risk. ASR supports automated test failover scheduling.
What is the RPO for Azure Site Recovery?
ASR provides continuous replication with an RPO as low as 30 seconds for Azure VMs. Recovery points are created every 5 minutes by default (configurable). App-consistent recovery points (capturing in-memory state) are generated hourly.
How much does disaster recovery cost on Azure?
ASR costs ~$25/month per protected VM plus storage costs for replicated data. Azure Backup ranges from $2.50-$10/instance/month depending on the service. Cross-region data transfer costs $0.02-$0.08/GB. Test failovers incur compute costs only for the duration of the test.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.