AzureArchitectureadvanced

Disaster Recovery with Site Recovery

Q: What is Azure Site Recovery?

Azure Site Recovery (ASR) is a disaster recovery service that replicates VMs, physical servers, and workloads to a secondary Azure region. It provides continuous replication with RPO of seconds, automated failover, and recovery plans for multi-tier applications.

Q: What are Azure paired regions?

Paired regions are two Azure regions in the same geography (e.g., East US and West US) that are physically separated but guaranteed to have replication links. Azure prioritizes recovery of paired regions during widespread outages and ensures at least one region in each pair is updated at a time.

Q: How does test failover work?

Test failover creates recovery VMs in an isolated network in the DR region without affecting production replication. You validate that applications work correctly, then clean up test resources. This allows regular DR testing without risk. ASR supports automated test failover scheduling.

Q: What is the RPO for Azure Site Recovery?

ASR provides continuous replication with an RPO as low as 30 seconds for Azure VMs. Recovery points are created every 5 minutes by default (configurable). App-consistent recovery points (capturing in-memory state) are generated hourly.

Q: How much does disaster recovery cost on Azure?

ASR costs ~$25/month per protected VM plus storage costs for replicated data. Azure Backup ranges from $2.50-$10/instance/month depending on the service. Cross-region data transfer costs $0.02-$0.08/GB. Test failovers incur compute costs only for the duration of the test.

Implement disaster recovery on Azure with Site Recovery, geo-replication, paired regions, Traffic Manager failover, and DR testing.

CloudToolStack Editorial26 min readPublished Feb 22, 2026

Prerequisites

Understanding of Azure core services (VMs, Storage, VNet)
Familiarity with Azure regions and availability zones
Understanding of the Azure Well-Architected Framework
Experience with infrastructure as code

Disaster Recovery on Azure

Disaster recovery (DR) is the set of policies, tools, and procedures designed to enable the recovery of critical technology infrastructure and systems following a natural or human-induced disaster. In cloud computing, DR goes beyond simply backing up data. It encompasses the ability to restore entire application stacks, including compute resources, networking, data, and configurations, in a secondary location within acceptable time and data loss parameters.

Azure provides a comprehensive set of services for building disaster recovery solutions, ranging from simple cross-region data replication to fully automated failover of entire application environments. The right DR strategy depends on your application's criticality, your organization's tolerance for downtime and data loss, and the budget available for DR infrastructure.

This guide covers the fundamental concepts of disaster recovery, Azure's DR services (with a deep focus on Azure Site Recovery), and practical implementation patterns for building resilient architectures. We will cover RPO and RTO planning, replication strategies, geo-replication for databases, Traffic Manager integration, DR testing, and cost optimization for DR deployments.

Business Continuity vs Disaster Recovery

Business Continuity (BC) and Disaster Recovery (DR) are related but distinct concepts. Business continuity is the broader practice of ensuring critical business functions continue during and after a disruption. Disaster recovery is a subset of business continuity focused specifically on restoring IT systems and data. An effective DR plan is just one component of a comprehensive business continuity strategy that also includes people, processes, communication plans, and organizational governance.

RPO & RTO Fundamentals

Two metrics define the core requirements of any disaster recovery plan: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Understanding these metrics is essential because they drive every architectural decision in your DR strategy, from how frequently data is replicated to how quickly failover must complete.

Metric	Definition	Answers the Question	Drives Decisions About
RPO (Recovery Point Objective)	Maximum acceptable amount of data loss measured in time	"How much data can we afford to lose?"	Replication frequency, backup schedules, data consistency
RTO (Recovery Time Objective)	Maximum acceptable time to restore service after a disaster	"How long can we be down?"	Failover automation, warm vs cold standby, DNS TTL

DR Tiers

DR strategies can be categorized into tiers based on their RPO/RTO targets and associated costs. Lower RPO and RTO values require more infrastructure investment but provide better protection.

Tier	Strategy	RPO	RTO	Relative Cost
Tier 1	Backup & Restore	Hours to days	Hours to days	$ (lowest)
Tier 2	Pilot Light	Minutes to hours	Hours	$$
Tier 3	Warm Standby	Minutes	Minutes to hours	$$$
Tier 4	Hot Standby / Active-Active	Near zero	Seconds to minutes	$$$$ (highest)

Cost vs Risk Trade-off

DR is fundamentally a business decision, not just a technical one. The cost of DR infrastructure must be weighed against the cost of downtime. A business that loses $100,000 per hour of downtime can justify spending significantly more on DR than one that loses $1,000 per hour. Work with business stakeholders to establish RPO/RTO requirements based on actual business impact analysis before designing your DR architecture.

Azure Site Recovery Overview

Azure Site Recovery (ASR) is Azure's primary disaster recovery service. It provides automated replication, failover, and recovery for virtual machines, physical servers, and some PaaS workloads. ASR continuously replicates VMs from a primary region to a secondary region, maintaining replica VMs that can be activated within minutes when a disaster occurs.

What ASR Replicates

Source	Target	Replication Method
Azure VMs	Another Azure region	Continuous replication via ASR agent
VMware VMs	Azure	Via ASR process server on-premises
Hyper-V VMs	Azure	Via Hyper-V Replica + ASR provider
Physical Servers	Azure	Via ASR mobility agent

Terminal: Set up Azure Site Recovery for Azure VMs

# Create a Recovery Services vault in the DR region
az backup vault create \
  --resource-group rg-dr-westus \
  --name rsv-dr-westus \
  --location westus2

# Note: ASR configuration is primarily done through the portal or PowerShell
# because the Azure CLI has limited ASR support.
# The following uses Azure PowerShell for full ASR configuration:

# PowerShell: Set up ASR for Azure-to-Azure VM replication
# Connect-AzAccount

# Set the vault context
$vault = Get-AzRecoveryServicesVault -Name "rsv-dr-westus" -ResourceGroupName "rg-dr-westus"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault

# Create ASR fabric for source and target regions
$sourceFabric = New-AzRecoveryServicesAsrFabric \
  -Name "asr-fabric-eastus" \
  -Azure \
  -Location "eastus"

$targetFabric = New-AzRecoveryServicesAsrFabric \
  -Name "asr-fabric-westus2" \
  -Azure \
  -Location "westus2"

# Create protection containers
$sourceContainer = New-AzRecoveryServicesAsrProtectionContainer \
  -Name "asr-container-eastus" \
  -Fabric $sourceFabric

$targetContainer = New-AzRecoveryServicesAsrProtectionContainer \
  -Name "asr-container-westus2" \
  -Fabric $targetFabric

# Create replication policy
$policy = New-AzRecoveryServicesAsrPolicy \
  -Name "24-hour-retention" \
  -ReplicationProvider "A2A" \
  -RecoveryPointRetentionInHours 24 \
  -ApplicationConsistentSnapshotFrequencyInHours 4 \
  -MultiVmSyncStatus Enable

# Create container mapping (source -> target)
New-AzRecoveryServicesAsrProtectionContainerMapping \
  -Name "eastus-to-westus2" \
  -Policy $policy \
  -PrimaryProtectionContainer $sourceContainer \
  -RecoveryProtectionContainer $targetContainer

Enabling Replication for a VM

PowerShell: Enable VM replication

# Get the source VM
$vm = Get-AzVM -ResourceGroupName "rg-app-prod" -Name "vm-webserver-01"

# Get the disk details
$osDisk = New-AzRecoveryServicesAsrAzureToAzureDiskReplicationConfig `
  -ManagedDisk `
  -LogStorageAccountId "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus/providers/Microsoft.Storage/storageAccounts/stasrcache" `
  -DiskId $vm.StorageProfile.OsDisk.ManagedDisk.Id `
  -RecoveryResourceGroupId "/subscriptions/<sub-id>/resourceGroups/rg-app-dr" `
  -RecoveryReplicaDiskAccountType "Premium_LRS" `
  -RecoveryTargetDiskAccountType "Premium_LRS"

# Enable replication
New-AzRecoveryServicesAsrReplicationProtectedItem `
  -AzureToAzure `
  -AzureVmId $vm.Id `
  -Name "vm-webserver-01-asr" `
  -ProtectionContainerMapping $containerMapping `
  -AzureToAzureDiskReplicationConfiguration @($osDisk) `
  -RecoveryResourceGroupId "/subscriptions/<sub-id>/resourceGroups/rg-app-dr" `
  -RecoveryAvailabilityZone "1" `
  -RecoveryAzureNetworkId "/subscriptions/<sub-id>/resourceGroups/rg-network-dr/providers/Microsoft.Network/virtualNetworks/vnet-dr" `
  -RecoveryAzureSubnetName "snet-app"

# Monitor replication health
Get-AzRecoveryServicesAsrReplicationProtectedItem `
  -ProtectionContainer $sourceContainer | `
  Select-Object FriendlyName, ProtectionState, ReplicationHealth, `
  TestFailoverState, ActiveLocation

Replication & Recovery Plans

Recovery Plans in Azure Site Recovery define the sequence of steps for failing over an entire application stack. Instead of failing over individual VMs one at a time, a recovery plan groups VMs into ordered groups with dependencies, scripts, and manual actions, ensuring your application recovers in the correct order.

Recovery Plan Design

A well-designed recovery plan reflects your application's startup dependencies. For example, a typical three-tier application should start databases first (Group 1), then application servers (Group 2), and finally web frontends (Group 3). Between groups, you can add automation scripts that perform tasks like updating connection strings, warming caches, or running database migrations.

PowerShell: Create a recovery plan

# Create a recovery plan with ordered groups
$plan = New-AzRecoveryServicesAsrRecoveryPlan `
  -Name "rp-myapp-full" `
  -PrimaryFabric $sourceFabric `
  -RecoveryFabric $targetFabric `
  -ReplicationProtectedItem @($dbVm, $appVm1, $appVm2, $webVm1, $webVm2)

# Edit the plan to define startup order
# Group 1: Database servers (start first)
# Group 2: Application servers (start after DB)
# Group 3: Web servers (start last)

$plan = Edit-AzRecoveryServicesAsrRecoveryPlan -RecoveryPlan $plan

# Add a pre-action script to Group 2 (runs before app servers start)
# This script could update connection strings or validate database availability
$scriptAction = New-AzRecoveryServicesAsrRecoveryPlanAction `
  -Name "Validate-Database" `
  -RunbookId "/subscriptions/<sub-id>/resourceGroups/rg-automation/providers/Microsoft.Automation/automationAccounts/aa-dr/runbooks/Validate-DatabaseConnection" `
  -FabricSide "Primary" `
  -ActionType "AutomationRunbook"

# Add a post-action to Group 3 (runs after web servers start)
$healthCheck = New-AzRecoveryServicesAsrRecoveryPlanAction `
  -Name "Health-Check" `
  -RunbookId "/subscriptions/<sub-id>/resourceGroups/rg-automation/providers/Microsoft.Automation/automationAccounts/aa-dr/runbooks/Run-HealthCheck" `
  -FabricSide "Primary" `
  -ActionType "AutomationRunbook"

# View recovery plan details
Get-AzRecoveryServicesAsrRecoveryPlan -Name "rp-myapp-full" | `
  Select-Object Name, @{N="Groups"; E={$_.Groups.Count}}, `
  @{N="VMs"; E={($_.Groups | ForEach-Object { $_.ReplicationProtectedItems }).Count}}

Automation Runbooks in Recovery Plans

Azure Automation Runbooks integrated into recovery plans are the key to achieving fully automated failover. Common automation tasks include: updating DNS records, modifying connection strings in App Settings, disabling scheduled jobs in the primary region, notifying operations teams via webhook, running database failover commands, and executing health checks after each group completes. Invest time in building and testing these automations; they are the difference between a 5-minute automated failover and a multi-hour manual recovery process.

Availability Zones & Paired Regions

Azure's physical infrastructure is organized into regions, each containing multiple availability zones. Understanding this geography is fundamental to DR planning because it determines which failure scenarios your architecture can survive and what replication options are available.

Availability Zones

Each Azure region with availability zone support has at least three physically separated data centers (zones) with independent power, cooling, and networking. Deploying across zones protects against data center failures within a single region. Zone-redundant deployments provide 99.99% SLA for VMs and are the foundation of high availability within a region.

Paired Regions

Azure pairs most regions with another region in the same geography (typically 300+ miles apart). Paired regions receive prioritized recovery during widespread outages, sequential platform updates (to avoid simultaneous failures), and physical isolation to minimize the chance of a single event affecting both regions.

Primary Region	Paired Region	Geography
East US	West US	United States
East US 2	Central US	United States
North Europe	West Europe	Europe
UK South	UK West	United Kingdom
Southeast Asia	East Asia	Asia Pacific
Australia East	Australia Southeast	Australia

bicep: Zone-redundant deployment

// Deploy VMs across availability zones for intra-region HA
param location string = 'eastus'
param vmCount int = 3

resource availabilityZoneVMs 'Microsoft.Compute/virtualMachines@2023-09-01' = [for i in range(0, vmCount): {
  name: 'vm-web-${padLeft(string(i + 1), 2, '0')}'
  location: location
  zones: [string((i % 3) + 1)] // Distribute across zones 1, 2, 3
  properties: {
    hardwareProfile: {
      vmSize: 'Standard_D4s_v5'
    }
    storageProfile: {
      osDisk: {
        createOption: 'FromImage'
        managedDisk: {
          storageAccountType: 'Premium_ZRS' // Zone-redundant storage
        }
      }
      imageReference: {
        publisher: 'Canonical'
        offer: '0001-com-ubuntu-server-jammy'
        sku: '22_04-lts-gen2'
        version: 'latest'
      }
    }
    osProfile: {
      computerName: 'vm-web-${padLeft(string(i + 1), 2, '0')}'
      adminUsername: 'azureuser'
      linuxConfiguration: {
        disablePasswordAuthentication: true
        ssh: {
          publicKeys: [
            {
              path: '/home/azureuser/.ssh/authorized_keys'
              keyData: loadTextContent('id_rsa.pub')
            }
          ]
        }
      }
    }
    networkProfile: {
      networkInterfaces: [
        {
          id: nics[i].id
        }
      ]
    }
  }
}]

// Zone-redundant load balancer
resource loadBalancer 'Microsoft.Network/loadBalancers@2023-06-01' = {
  name: 'lb-web-prod'
  location: location
  sku: {
    name: 'Standard'  // Standard SKU required for zone redundancy
    tier: 'Regional'
  }
  properties: {
    frontendIPConfigurations: [
      {
        name: 'frontend'
        zones: ['1', '2', '3']  // Zone-redundant frontend
        properties: {
          publicIPAddress: {
            id: publicIP.id
          }
        }
      }
    ]
  }
}

Azure Backup Integration

Azure Backup provides simple, secure, and cost-effective solutions for backing up data and recovering it from Azure. While Azure Site Recovery handles live replication for rapid failover, Azure Backup handles point-in-time snapshots for data recovery from accidental deletion, corruption, or ransomware. A complete DR strategy typically uses both services: ASR for infrastructure failover and Azure Backup for data protection.

Backup Scope

Resource Type	Backup Method	Retention
Azure VMs	Snapshot-based (agentless)	Up to 9999 days
Azure SQL Database	Automated (built-in PITR)	7–35 days (PITR), long-term retention available
Azure Files	Share snapshots via Backup	Configurable up to 9999 days
Azure Blobs	Operational/vaulted backup	Configurable, cross-region with vault
SQL Server in VM	Workload-aware backup (agent)	Configurable, log backups every 15 min
SAP HANA in VM	Backint integration	Configurable
Azure Kubernetes Service	Extension-based backup	Configurable

Terminal: Configure Azure Backup for VMs

# Create a Recovery Services vault for backups
az backup vault create \
  --resource-group rg-backup \
  --name rsv-backup-prod \
  --location eastus

# Create a backup policy
az backup policy create \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --name policy-vm-daily \
  --backup-management-type AzureIaasVM \
  --policy '{
    "schedulePolicy": {
      "schedulePolicyType": "SimpleSchedulePolicy",
      "scheduleRunFrequency": "Daily",
      "scheduleRunTimes": ["2024-01-01T02:00:00Z"]
    },
    "retentionPolicy": {
      "retentionPolicyType": "LongTermRetentionPolicy",
      "dailySchedule": {
        "retentionTimes": ["2024-01-01T02:00:00Z"],
        "retentionDuration": { "count": 30, "durationType": "Days" }
      },
      "weeklySchedule": {
        "daysOfTheWeek": ["Sunday"],
        "retentionTimes": ["2024-01-01T02:00:00Z"],
        "retentionDuration": { "count": 12, "durationType": "Weeks" }
      },
      "monthlySchedule": {
        "retentionScheduleFormatType": "Weekly",
        "retentionScheduleWeekly": {
          "daysOfTheWeek": ["Sunday"],
          "weeksOfTheMonth": ["First"]
        },
        "retentionTimes": ["2024-01-01T02:00:00Z"],
        "retentionDuration": { "count": 12, "durationType": "Months" }
      }
    }
  }'

# Enable backup for a VM
az backup protection enable-for-vm \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --vm /subscriptions/<sub-id>/resourceGroups/rg-app/providers/Microsoft.Compute/virtualMachines/vm-webserver-01 \
  --policy-name policy-vm-daily

# Trigger an on-demand backup
az backup protection backup-now \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --container-name "IaasVMContainerV2;rg-app;vm-webserver-01" \
  --item-name "vm-webserver-01" \
  --retain-until 2024-02-15

Traffic Manager for DR

Azure Traffic Manager is a DNS-based global traffic distribution service that plays a critical role in disaster recovery architectures. By routing users to the healthy regional deployment based on health probes, Traffic Manager enables automatic failover at the DNS level. When the primary region becomes unhealthy, Traffic Manager stops resolving to that region's endpoints and directs all traffic to the secondary region.

Traffic Manager Routing Methods for DR

Method	How It Works	DR Use Case
Priority	Routes to highest priority endpoint; failover to next on failure	Active/passive DR with clear primary
Performance	Routes to the endpoint with lowest network latency	Active/active with automatic region selection
Geographic	Routes based on the user's geographic location	Data residency compliance + DR
Weighted	Distributes traffic proportionally across endpoints	Gradual failover / canary DR testing

Terminal: Configure Traffic Manager for DR failover

# Create a Traffic Manager profile with priority routing
az network traffic-manager profile create \
  --resource-group rg-dr \
  --name tm-myapp-dr \
  --routing-method Priority \
  --unique-dns-name myapp-global \
  --ttl 60 \
  --protocol HTTPS \
  --port 443 \
  --path "/health" \
  --interval 10 \
  --timeout 5 \
  --max-failures 3

# Add primary endpoint (East US)
az network traffic-manager endpoint create \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-eastus-primary \
  --type azureEndpoints \
  --target-resource-id /subscriptions/<sub-id>/resourceGroups/rg-app-eastus/providers/Microsoft.Web/sites/myapp-eastus \
  --priority 1 \
  --endpoint-status Enabled

# Add secondary endpoint (West US - DR site)
az network traffic-manager endpoint create \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-westus-secondary \
  --type azureEndpoints \
  --target-resource-id /subscriptions/<sub-id>/resourceGroups/rg-app-westus/providers/Microsoft.Web/sites/myapp-westus \
  --priority 2 \
  --endpoint-status Enabled

# Check Traffic Manager endpoint health
az network traffic-manager endpoint show \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-eastus-primary \
  --type azureEndpoints \
  --query '{Name:name, Status:endpointStatus, MonitorStatus:endpointMonitorStatus, Priority:priority}' \
  --output table

# Simulate failover: disable the primary endpoint
az network traffic-manager endpoint update \
  --resource-group rg-dr \
  --profile-name tm-myapp-dr \
  --name ep-eastus-primary \
  --type azureEndpoints \
  --endpoint-status Disabled

DNS TTL and Failover Time

Traffic Manager failover time is affected by the DNS TTL and the health probe interval. With a 60-second TTL, 10-second probe interval, and 3 tolerated failures, the worst-case failover time is approximately 90 seconds (30 seconds for probe failure detection plus 60 seconds for DNS cache expiry). For faster failover, reduce the TTL (minimum 0 seconds, though most resolvers enforce a floor) and probe interval. Note that lower TTLs increase DNS query volume, which may slightly increase costs.

Database Geo-Replication

Database geo-replication is often the most complex component of a DR strategy because databases contain the application's state and must be consistent after failover. Azure provides built-in geo-replication for most managed database services, but each service has different replication characteristics and failover procedures.

Service	Geo-Replication Type	Failover	Data Loss Risk
Azure SQL Database	Active geo-replication or auto-failover groups	Automatic (failover groups) or manual	Minimal (async, typically < 5 seconds lag)
Azure Cosmos DB	Multi-region writes or single-write with read replicas	Automatic (service-managed) or manual	Configurable via consistency level
Azure Database for PostgreSQL	Read replicas across regions	Manual (promote replica)	Depends on replication lag
Azure Cache for Redis	Passive (Premium) or active (Enterprise)	Manual (Premium) or automatic (Enterprise)	Depends on replication lag
Azure Storage	GRS/GZRS (async cross-region replication)	Manual (initiate account failover)	Up to 15 minutes RPO

Terminal: Configure Azure SQL auto-failover group

# Create a primary SQL server (East US)
az sql server create \
  --resource-group rg-data-eastus \
  --name sql-myapp-eastus \
  --location eastus \
  --admin-user sqladmin \
  --admin-password "<strong-password>"

# Create a secondary SQL server (West US) for DR
az sql server create \
  --resource-group rg-data-westus \
  --name sql-myapp-westus \
  --location westus \
  --admin-user sqladmin \
  --admin-password "<strong-password>"

# Create a database on the primary server
az sql db create \
  --resource-group rg-data-eastus \
  --server sql-myapp-eastus \
  --name db-myapp \
  --service-objective S3 \
  --backup-storage-redundancy Geo

# Create an auto-failover group
az sql failover-group create \
  --resource-group rg-data-eastus \
  --server sql-myapp-eastus \
  --partner-server sql-myapp-westus \
  --partner-resource-group rg-data-westus \
  --name fog-myapp \
  --add-db db-myapp \
  --failover-policy Automatic \
  --grace-period 1

# The failover group creates a listener endpoint:
# fog-myapp.database.windows.net (always points to primary)
# fog-myapp.secondary.database.windows.net (always points to secondary)
# Your application should connect to the listener endpoint

# Test failover (manual trigger)
az sql failover-group set-primary \
  --resource-group rg-data-westus \
  --server sql-myapp-westus \
  --name fog-myapp

# Fail back to original primary
az sql failover-group set-primary \
  --resource-group rg-data-eastus \
  --server sql-myapp-eastus \
  --name fog-myapp

DR Testing & Drills

A disaster recovery plan that has never been tested is not a plan; it is a hope. Regular DR testing validates that your recovery procedures work, identifies gaps in automation, trains your team on the failover process, and provides evidence for compliance auditors. Azure Site Recovery supports test failovers that create isolated recovery VMs without impacting production or replication.

Types of DR Tests

Test Type	Impact	Frequency	Scope
Tabletop Exercise	None (discussion only)	Quarterly	Walk through scenarios, review runbooks
Test Failover (ASR)	None (isolated network)	Monthly	Validate replication, test recovery plans
Planned Failover	Controlled downtime	Bi-annually	Full end-to-end failover with DNS switch
Unannounced Drill	Real failover	Annually	Chaos engineering; test team readiness

PowerShell: Execute a test failover

# Create an isolated virtual network for the test
$testVnet = New-AzVirtualNetwork `
  -Name "vnet-dr-test" `
  -ResourceGroupName "rg-dr-testing" `
  -Location "westus2" `
  -AddressPrefix "10.99.0.0/16"

Add-AzVirtualNetworkSubnetConfig `
  -Name "snet-test" `
  -VirtualNetwork $testVnet `
  -AddressPrefix "10.99.1.0/24"

$testVnet | Set-AzVirtualNetwork

# Execute test failover using the recovery plan
$plan = Get-AzRecoveryServicesAsrRecoveryPlan -Name "rp-myapp-full"

Start-AzRecoveryServicesAsrTestFailoverJob `
  -RecoveryPlan $plan `
  -Direction PrimaryToRecovery `
  -AzureVMNetworkId $testVnet.Id

# Monitor test failover progress
Get-AzRecoveryServicesAsrJob | `
  Where-Object { $_.State -eq "InProgress" } | `
  Select-Object Name, State, StartTime, `
  @{N="Duration"; E={(Get-Date) - $_.StartTime}}

# After validation: Clean up test failover resources
Start-AzRecoveryServicesAsrTestFailoverCleanupJob `
  -RecoveryPlan $plan `
  -Comment "DR test completed successfully. All services validated."

DR Test Automation

Automate your DR tests using Azure DevOps Pipelines or Azure Automation. Create a scheduled pipeline that performs a test failover monthly, runs automated validation checks (HTTP health endpoints, database connectivity, data integrity), captures screenshots of the running application, and sends a report to the DR team. This ensures DR is tested consistently without relying on manual scheduling and execution.

Best Practices & Cost Optimization

Building and maintaining a disaster recovery capability requires ongoing investment in infrastructure, automation, and testing. The following best practices help you build effective DR while managing costs.

Architecture Best Practices

Use PaaS services where possible: Platform services like Azure SQL Database, Cosmos DB, and App Service have built-in geo-replication and failover capabilities that are simpler and more reliable than managing DR for IaaS VMs.
Automate everything: Manual DR procedures are slow, error-prone, and stressful during an actual disaster. Invest in automation through recovery plans, runbooks, and infrastructure-as-code so that failover is a button press, not a multi-page runbook.
Design for failback: Failing over to the DR region is only half the challenge. Plan and test the failback process (returning to the primary region) before you need it. Failback is often more complex because data has been modified in the DR region during the outage.
Document everything: Maintain up-to-date DR documentation including architecture diagrams, runbooks, contact lists, escalation procedures, and configuration details. Store documentation in a location accessible during an outage (not solely in the primary region).
Test regularly: Conduct DR tests at least quarterly using a mix of tabletop exercises, test failovers, and planned failovers. Each test should produce a report documenting what worked, what did not, and action items for improvement.

Cost Optimization

Use Azure Site Recovery for VM replication: ASR costs approximately $25/month per replicated VM, which is far cheaper than running idle VMs in a warm standby DR region.
Right-size DR resources: The DR region does not need to match the production region's capacity. Use smaller VM sizes in the DR region and scale up as part of the failover process (at the cost of a slightly longer RTO).
Leverage reserved instances strategically: If you use Reserved Instances in your primary region, Azure allows you to use the same reservations in the DR region during failover (instance size flexibility).
Use Azure Hybrid Benefit: Apply existing Windows Server and SQL Server licenses to DR VMs to reduce compute costs.
Monitor DR costs separately: Tag all DR resources with a consistent tag (e.g., purpose: disaster-recovery) to track DR costs independently and ensure they remain proportional to the risk they mitigate.
Review DR scope regularly: As your application evolves, some components may become less critical while new critical services are added. Review your DR scope quarterly to ensure you are protecting the right things.

Terminal: Monitor DR health and costs

# Check ASR replication health for all VMs
az resource list \
  --resource-type "Microsoft.RecoveryServices/vaults" \
  --query "[].{Name:name, ResourceGroup:resourceGroup, Location:location}" \
  --output table

# View backup jobs status
az backup job list \
  --resource-group rg-backup \
  --vault-name rsv-backup-prod \
  --query "[?status=='InProgress' || status=='Failed'].{Job:name, Status:status, StartTime:startTime}" \
  --output table

# Check Traffic Manager endpoint health
az network traffic-manager profile show \
  --resource-group rg-dr \
  --name tm-myapp-dr \
  --query '{Profile:name, Status:profileStatus, Endpoints:endpoints[].{Name:name, Status:endpointStatus, Monitor:endpointMonitorStatus}}' \
  --output json

# Estimate DR costs with resource tags
az cost management query \
  --type Usage \
  --timeframe MonthToDate \
  --dataset-filter '{"Tags": {"Name": "purpose", "Values": ["disaster-recovery"]}}' \
  --query "properties.rows[].{Service:[0], Cost:[1]}" \
  --output table

Azure Chaos Studio

Azure Chaos Studio is a managed chaos engineering service that helps you test your application's resilience by deliberately injecting faults, such as VM shutdowns, network disconnections, DNS failures, and CPU/memory stress. Use Chaos Studio alongside your DR testing program to validate that your application degrades gracefully under partial failures and that your DR automation triggers correctly when real infrastructure problems occur.

Azure Well-Architected Overview Multi-Cloud Disaster Recovery Guide

Key Takeaways

1Azure Site Recovery (ASR) provides continuous replication of VMs to a secondary region.
2Paired regions ensure data residency compliance and prioritized recovery during outages.
3Recovery plans automate multi-tier application failover with custom scripts and sequencing.
4Azure Backup provides application-consistent snapshots for RPOs from hours to minutes.
5Traffic Manager and Front Door provide DNS-based and Layer 7 failover respectively.
6Regular DR drills using test failover validate recovery procedures without affecting production.

Frequently Asked Questions

What is Azure Site Recovery?

Azure Site Recovery (ASR) is a disaster recovery service that replicates VMs, physical servers, and workloads to a secondary Azure region. It provides continuous replication with RPO of seconds, automated failover, and recovery plans for multi-tier applications.

What are Azure paired regions?

Paired regions are two Azure regions in the same geography (e.g., East US and West US) that are physically separated but guaranteed to have replication links. Azure prioritizes recovery of paired regions during widespread outages and ensures at least one region in each pair is updated at a time.

How does test failover work?

Test failover creates recovery VMs in an isolated network in the DR region without affecting production replication. You validate that applications work correctly, then clean up test resources. This allows regular DR testing without risk. ASR supports automated test failover scheduling.

What is the RPO for Azure Site Recovery?

ASR provides continuous replication with an RPO as low as 30 seconds for Azure VMs. Recovery points are created every 5 minutes by default (configurable). App-consistent recovery points (capturing in-memory state) are generated hourly.

How much does disaster recovery cost on Azure?

ASR costs ~$25/month per protected VM plus storage costs for replicated data. Azure Backup ranges from $2.50-$10/instance/month depending on the service. Cross-region data transfer costs $0.02-$0.08/GB. Test failovers incur compute costs only for the duration of the test.

Written by CloudToolStack Editorial

Written and reviewed by the CloudToolStack editorial team. Every guide is verified against current provider documentation and revised in place when providers change pricing, deprecate services, or release meaningfully better alternatives.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.