Disaster Recovery with Site Recovery
Implement disaster recovery on Azure with Site Recovery, geo-replication, paired regions, Traffic Manager failover, and DR testing.
Prerequisites
- Understanding of Azure core services (VMs, Storage, VNet)
- Familiarity with Azure regions and availability zones
- Understanding of the Azure Well-Architected Framework
- Experience with infrastructure as code
Disaster Recovery on Azure
Disaster recovery (DR) is the set of policies, tools, and procedures designed to enable the recovery of critical technology infrastructure and systems following a natural or human-induced disaster. In cloud computing, DR goes beyond simply backing up data. It encompasses the ability to restore entire application stacks, including compute resources, networking, data, and configurations, in a secondary location within acceptable time and data loss parameters.
Azure provides a comprehensive set of services for building disaster recovery solutions, ranging from simple cross-region data replication to fully automated failover of entire application environments. The right DR strategy depends on your application's criticality, your organization's tolerance for downtime and data loss, and the budget available for DR infrastructure.
This guide covers the fundamental concepts of disaster recovery, Azure's DR services (with a deep focus on Azure Site Recovery), and practical implementation patterns for building resilient architectures. We will cover RPO and RTO planning, replication strategies, geo-replication for databases, Traffic Manager integration, DR testing, and cost optimization for DR deployments.
Business Continuity vs Disaster Recovery
Business Continuity (BC) and Disaster Recovery (DR) are related but distinct concepts. Business continuity is the broader practice of ensuring critical business functions continue during and after a disruption. Disaster recovery is a subset of business continuity focused specifically on restoring IT systems and data. An effective DR plan is just one component of a comprehensive business continuity strategy that also includes people, processes, communication plans, and organizational governance.
RPO & RTO Fundamentals
Two metrics define the core requirements of any disaster recovery plan: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Understanding these metrics is essential because they drive every architectural decision in your DR strategy, from how frequently data is replicated to how quickly failover must complete.
| Metric | Definition | Answers the Question | Drives Decisions About |
|---|---|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable amount of data loss measured in time | "How much data can we afford to lose?" | Replication frequency, backup schedules, data consistency |
| RTO (Recovery Time Objective) | Maximum acceptable time to restore service after a disaster | "How long can we be down?" | Failover automation, warm vs cold standby, DNS TTL |
DR Tiers
DR strategies can be categorized into tiers based on their RPO/RTO targets and associated costs. Lower RPO and RTO values require more infrastructure investment but provide better protection.
| Tier | Strategy | RPO | RTO | Relative Cost |
|---|---|---|---|---|
| Tier 1 | Backup & Restore | Hours to days | Hours to days | $ (lowest) |
| Tier 2 | Pilot Light | Minutes to hours | Hours | $$ |
| Tier 3 | Warm Standby | Minutes | Minutes to hours | $$$ |
| Tier 4 | Hot Standby / Active-Active | Near zero | Seconds to minutes | $$$$ (highest) |
Cost vs Risk Trade-off
DR is fundamentally a business decision, not just a technical one. The cost of DR infrastructure must be weighed against the cost of downtime. A business that loses $100,000 per hour of downtime can justify spending significantly more on DR than one that loses $1,000 per hour. Work with business stakeholders to establish RPO/RTO requirements based on actual business impact analysis before designing your DR architecture.
Azure Site Recovery Overview
Azure Site Recovery (ASR) is Azure's primary disaster recovery service. It provides automated replication, failover, and recovery for virtual machines, physical servers, and some PaaS workloads. ASR continuously replicates VMs from a primary region to a secondary region, maintaining replica VMs that can be activated within minutes when a disaster occurs.
What ASR Replicates
| Source | Target | Replication Method |
|---|---|---|
| Azure VMs | Another Azure region | Continuous replication via ASR agent |
| VMware VMs | Azure | Via ASR process server on-premises |
| Hyper-V VMs | Azure | Via Hyper-V Replica + ASR provider |
| Physical Servers | Azure | Via ASR mobility agent |
# Create a Recovery Services vault in the DR region
az backup vault create \
--resource-group rg-dr-westus \
--name rsv-dr-westus \
--location westus2
# Note: ASR configuration is primarily done through the portal or PowerShell
# because the Azure CLI has limited ASR support.
# The following uses Azure PowerShell for full ASR configuration:
# PowerShell: Set up ASR for Azure-to-Azure VM replication
# Connect-AzAccount
# Set the vault context
$vault = Get-AzRecoveryServicesVault -Name "rsv-dr-westus" -ResourceGroupName "rg-dr-westus"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault
# Create ASR fabric for source and target regions
$sourceFabric = New-AzRecoveryServicesAsrFabric \
-Name "asr-fabric-eastus" \
-Azure \
-Location "eastus"
$targetFabric = New-AzRecoveryServicesAsrFabric \
-Name "asr-fabric-westus2" \
-Azure \
-Location "westus2"
# Create protection containers
$sourceContainer = New-AzRecoveryServicesAsrProtectionContainer \
-Name "asr-container-eastus" \
-Fabric $sourceFabric
$targetContainer = New-AzRecoveryServicesAsrProtectionContainer \
-Name "asr-container-westus2" \
-Fabric $targetFabric
# Create replication policy
$policy = New-AzRecoveryServicesAsrPolicy \
-Name "24-hour-retention" \
-ReplicationProvider "A2A" \
-RecoveryPointRetentionInHours 24 \
-ApplicationConsistentSnapshotFrequencyInHours 4 \
-MultiVmSyncStatus Enable
# Create container mapping (source -> target)
New-AzRecoveryServicesAsrProtectionContainerMapping \
-Name "eastus-to-westus2" \
-Policy $policy \
-PrimaryProtectionContainer $sourceContainer \
-RecoveryProtectionContainer $targetContainerEnabling Replication for a VM
# Get the source VM
$vm = Get-AzVM -ResourceGroupName "rg-app-prod" -Name "vm-webserver-01"
# Get the disk details
$osDisk = New-AzRecoveryServicesAsrAzureToAzureDiskReplicationConfig `
-ManagedDisk `
-LogStorageAccountId "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus/providers/Microsoft.Storage/storageAccounts/stasrcache" `
-DiskId $vm.StorageProfile.OsDisk.ManagedDisk.Id `
-RecoveryResourceGroupId "/subscriptions/<sub-id>/resourceGroups/rg-app-dr" `
-RecoveryReplicaDiskAccountType "Premium_LRS" `
-RecoveryTargetDiskAccountType "Premium_LRS"
# Enable replication
New-AzRecoveryServicesAsrReplicationProtectedItem `
-AzureToAzure `
-AzureVmId $vm.Id `
-Name "vm-webserver-01-asr" `
-ProtectionContainerMapping $containerMapping `
-AzureToAzureDiskReplicationConfiguration @($osDisk) `
-RecoveryResourceGroupId "/subscriptions/<sub-id>/resourceGroups/rg-app-dr" `
-RecoveryAvailabilityZone "1" `
-RecoveryAzureNetworkId "/subscriptions/<sub-id>/resourceGroups/rg-network-dr/providers/Microsoft.Network/virtualNetworks/vnet-dr" `
-RecoveryAzureSubnetName "snet-app"
# Monitor replication health
Get-AzRecoveryServicesAsrReplicationProtectedItem `
-ProtectionContainer $sourceContainer | `
Select-Object FriendlyName, ProtectionState, ReplicationHealth, `
TestFailoverState, ActiveLocationReplication & Recovery Plans
Recovery Plans in Azure Site Recovery define the sequence of steps for failing over an entire application stack. Instead of failing over individual VMs one at a time, a recovery plan groups VMs into ordered groups with dependencies, scripts, and manual actions, ensuring your application recovers in the correct order.
Recovery Plan Design
A well-designed recovery plan reflects your application's startup dependencies. For example, a typical three-tier application should start databases first (Group 1), then application servers (Group 2), and finally web frontends (Group 3). Between groups, you can add automation scripts that perform tasks like updating connection strings, warming caches, or running database migrations.
# Create a recovery plan with ordered groups
$plan = New-AzRecoveryServicesAsrRecoveryPlan `
-Name "rp-myapp-full" `
-PrimaryFabric $sourceFabric `
-RecoveryFabric $targetFabric `
-ReplicationProtectedItem @($dbVm, $appVm1, $appVm2, $webVm1, $webVm2)
# Edit the plan to define startup order
# Group 1: Database servers (start first)
# Group 2: Application servers (start after DB)
# Group 3: Web servers (start last)
$plan = Edit-AzRecoveryServicesAsrRecoveryPlan -RecoveryPlan $plan
# Add a pre-action script to Group 2 (runs before app servers start)
# This script could update connection strings or validate database availability
$scriptAction = New-AzRecoveryServicesAsrRecoveryPlanAction `
-Name "Validate-Database" `
-RunbookId "/subscriptions/<sub-id>/resourceGroups/rg-automation/providers/Microsoft.Automation/automationAccounts/aa-dr/runbooks/Validate-DatabaseConnection" `
-FabricSide "Primary" `
-ActionType "AutomationRunbook"
# Add a post-action to Group 3 (runs after web servers start)
$healthCheck = New-AzRecoveryServicesAsrRecoveryPlanAction `
-Name "Health-Check" `
-RunbookId "/subscriptions/<sub-id>/resourceGroups/rg-automation/providers/Microsoft.Automation/automationAccounts/aa-dr/runbooks/Run-HealthCheck" `
-FabricSide "Primary" `
-ActionType "AutomationRunbook"
# View recovery plan details
Get-AzRecoveryServicesAsrRecoveryPlan -Name "rp-myapp-full" | `
Select-Object Name, @{N="Groups"; E={$_.Groups.Count}}, `
@{N="VMs"; E={($_.Groups | ForEach-Object { $_.ReplicationProtectedItems }).Count}}Automation Runbooks in Recovery Plans
Azure Automation Runbooks integrated into recovery plans are the key to achieving fully automated failover. Common automation tasks include: updating DNS records, modifying connection strings in App Settings, disabling scheduled jobs in the primary region, notifying operations teams via webhook, running database failover commands, and executing health checks after each group completes. Invest time in building and testing these automations; they are the difference between a 5-minute automated failover and a multi-hour manual recovery process.
Availability Zones & Paired Regions
Azure's physical infrastructure is organized into regions, each containing multiple availability zones. Understanding this geography is fundamental to DR planning because it determines which failure scenarios your architecture can survive and what replication options are available.
Availability Zones
Each Azure region with availability zone support has at least three physically separated data centers (zones) with independent power, cooling, and networking. Deploying across zones protects against data center failures within a single region. Zone-redundant deployments provide 99.99% SLA for VMs and are the foundation of high availability within a region.
Paired Regions
Azure pairs most regions with another region in the same geography (typically 300+ miles apart). Paired regions receive prioritized recovery during widespread outages, sequential platform updates (to avoid simultaneous failures), and physical isolation to minimize the chance of a single event affecting both regions.
| Primary Region | Paired Region | Geography |
|---|---|---|
| East US | West US | United States |
| East US 2 | Central US | United States |
| North Europe | West Europe | Europe |
| UK South | UK West | United Kingdom |
| Southeast Asia | East Asia | Asia Pacific |
| Australia East | Australia Southeast | Australia |
// Deploy VMs across availability zones for intra-region HA
param location string = 'eastus'
param vmCount int = 3
resource availabilityZoneVMs 'Microsoft.Compute/virtualMachines@2023-09-01' = [for i in range(0, vmCount): {
name: 'vm-web-${padLeft(string(i + 1), 2, '0')}'
location: location
zones: [string((i % 3) + 1)] // Distribute across zones 1, 2, 3
properties: {
hardwareProfile: {
vmSize: 'Standard_D4s_v5'
}
storageProfile: {
osDisk: {
createOption: 'FromImage'
managedDisk: {
storageAccountType: 'Premium_ZRS' // Zone-redundant storage
}
}
imageReference: {
publisher: 'Canonical'
offer: '0001-com-ubuntu-server-jammy'
sku: '22_04-lts-gen2'
version: 'latest'
}
}
osProfile: {
computerName: 'vm-web-${padLeft(string(i + 1), 2, '0')}'
adminUsername: 'azureuser'
linuxConfiguration: {
disablePasswordAuthentication: true
ssh: {
publicKeys: [
{
path: '/home/azureuser/.ssh/authorized_keys'
keyData: loadTextContent('id_rsa.pub')
}
]
}
}
}
networkProfile: {
networkInterfaces: [
{
id: nics[i].id
}
]
}
}
}]
// Zone-redundant load balancer
resource loadBalancer 'Microsoft.Network/loadBalancers@2023-06-01' = {
name: 'lb-web-prod'
location: location
sku: {
name: 'Standard' // Standard SKU required for zone redundancy
tier: 'Regional'
}
properties: {
frontendIPConfigurations: [
{
name: 'frontend'
zones: ['1', '2', '3'] // Zone-redundant frontend
properties: {
publicIPAddress: {
id: publicIP.id
}
}
}
]
}
}Azure Backup Integration
Azure Backup provides simple, secure, and cost-effective solutions for backing up data and recovering it from Azure. While Azure Site Recovery handles live replication for rapid failover, Azure Backup handles point-in-time snapshots for data recovery from accidental deletion, corruption, or ransomware. A complete DR strategy typically uses both services: ASR for infrastructure failover and Azure Backup for data protection.
Backup Scope
| Resource Type | Backup Method | Retention |
|---|---|---|
| Azure VMs | Snapshot-based (agentless) | Up to 9999 days |
| Azure SQL Database | Automated (built-in PITR) | 7–35 days (PITR), long-term retention available |
| Azure Files | Share snapshots via Backup | Configurable up to 9999 days |
| Azure Blobs | Operational/vaulted backup | Configurable, cross-region with vault |
| SQL Server in VM | Workload-aware backup (agent) | Configurable, log backups every 15 min |
| SAP HANA in VM | Backint integration | Configurable |
| Azure Kubernetes Service | Extension-based backup | Configurable |
# Create a Recovery Services vault for backups
az backup vault create \
--resource-group rg-backup \
--name rsv-backup-prod \
--location eastus
# Create a backup policy
az backup policy create \
--resource-group rg-backup \
--vault-name rsv-backup-prod \
--name policy-vm-daily \
--backup-management-type AzureIaasVM \
--policy '{
"schedulePolicy": {
"schedulePolicyType": "SimpleSchedulePolicy",
"scheduleRunFrequency": "Daily",
"scheduleRunTimes": ["2024-01-01T02:00:00Z"]
},
"retentionPolicy": {
"retentionPolicyType": "LongTermRetentionPolicy",
"dailySchedule": {
"retentionTimes": ["2024-01-01T02:00:00Z"],
"retentionDuration": { "count": 30, "durationType": "Days" }
},
"weeklySchedule": {
"daysOfTheWeek": ["Sunday"],
"retentionTimes": ["2024-01-01T02:00:00Z"],
"retentionDuration": { "count": 12, "durationType": "Weeks" }
},
"monthlySchedule": {
"retentionScheduleFormatType": "Weekly",
"retentionScheduleWeekly": {
"daysOfTheWeek": ["Sunday"],
"weeksOfTheMonth": ["First"]
},
"retentionTimes": ["2024-01-01T02:00:00Z"],
"retentionDuration": { "count": 12, "durationType": "Months" }
}
}
}'
# Enable backup for a VM
az backup protection enable-for-vm \
--resource-group rg-backup \
--vault-name rsv-backup-prod \
--vm /subscriptions/<sub-id>/resourceGroups/rg-app/providers/Microsoft.Compute/virtualMachines/vm-webserver-01 \
--policy-name policy-vm-daily
# Trigger an on-demand backup
az backup protection backup-now \
--resource-group rg-backup \
--vault-name rsv-backup-prod \
--container-name "IaasVMContainerV2;rg-app;vm-webserver-01" \
--item-name "vm-webserver-01" \
--retain-until 2024-02-15Traffic Manager for DR
Azure Traffic Manager is a DNS-based global traffic distribution service that plays a critical role in disaster recovery architectures. By routing users to the healthy regional deployment based on health probes, Traffic Manager enables automatic failover at the DNS level. When the primary region becomes unhealthy, Traffic Manager stops resolving to that region's endpoints and directs all traffic to the secondary region.
Traffic Manager Routing Methods for DR
| Method | How It Works | DR Use Case |
|---|---|---|
| Priority | Routes to highest priority endpoint; failover to next on failure | Active/passive DR with clear primary |
| Performance | Routes to the endpoint with lowest network latency | Active/active with automatic region selection |
| Geographic | Routes based on the user's geographic location | Data residency compliance + DR |
| Weighted | Distributes traffic proportionally across endpoints | Gradual failover / canary DR testing |
# Create a Traffic Manager profile with priority routing
az network traffic-manager profile create \
--resource-group rg-dr \
--name tm-myapp-dr \
--routing-method Priority \
--unique-dns-name myapp-global \
--ttl 60 \
--protocol HTTPS \
--port 443 \
--path "/health" \
--interval 10 \
--timeout 5 \
--max-failures 3
# Add primary endpoint (East US)
az network traffic-manager endpoint create \
--resource-group rg-dr \
--profile-name tm-myapp-dr \
--name ep-eastus-primary \
--type azureEndpoints \
--target-resource-id /subscriptions/<sub-id>/resourceGroups/rg-app-eastus/providers/Microsoft.Web/sites/myapp-eastus \
--priority 1 \
--endpoint-status Enabled
# Add secondary endpoint (West US - DR site)
az network traffic-manager endpoint create \
--resource-group rg-dr \
--profile-name tm-myapp-dr \
--name ep-westus-secondary \
--type azureEndpoints \
--target-resource-id /subscriptions/<sub-id>/resourceGroups/rg-app-westus/providers/Microsoft.Web/sites/myapp-westus \
--priority 2 \
--endpoint-status Enabled
# Check Traffic Manager endpoint health
az network traffic-manager endpoint show \
--resource-group rg-dr \
--profile-name tm-myapp-dr \
--name ep-eastus-primary \
--type azureEndpoints \
--query '{Name:name, Status:endpointStatus, MonitorStatus:endpointMonitorStatus, Priority:priority}' \
--output table
# Simulate failover: disable the primary endpoint
az network traffic-manager endpoint update \
--resource-group rg-dr \
--profile-name tm-myapp-dr \
--name ep-eastus-primary \
--type azureEndpoints \
--endpoint-status DisabledDNS TTL and Failover Time
Traffic Manager failover time is affected by the DNS TTL and the health probe interval. With a 60-second TTL, 10-second probe interval, and 3 tolerated failures, the worst-case failover time is approximately 90 seconds (30 seconds for probe failure detection plus 60 seconds for DNS cache expiry). For faster failover, reduce the TTL (minimum 0 seconds, though most resolvers enforce a floor) and probe interval. Note that lower TTLs increase DNS query volume, which may slightly increase costs.
Database Geo-Replication
Database geo-replication is often the most complex component of a DR strategy because databases contain the application's state and must be consistent after failover. Azure provides built-in geo-replication for most managed database services, but each service has different replication characteristics and failover procedures.
| Service | Geo-Replication Type | Failover | Data Loss Risk |
|---|---|---|---|
| Azure SQL Database | Active geo-replication or auto-failover groups | Automatic (failover groups) or manual | Minimal (async, typically < 5 seconds lag) |
| Azure Cosmos DB | Multi-region writes or single-write with read replicas | Automatic (service-managed) or manual | Configurable via consistency level |
| Azure Database for PostgreSQL | Read replicas across regions | Manual (promote replica) | Depends on replication lag |
| Azure Cache for Redis | Passive (Premium) or active (Enterprise) | Manual (Premium) or automatic (Enterprise) | Depends on replication lag |
| Azure Storage | GRS/GZRS (async cross-region replication) | Manual (initiate account failover) | Up to 15 minutes RPO |
# Create a primary SQL server (East US)
az sql server create \
--resource-group rg-data-eastus \
--name sql-myapp-eastus \
--location eastus \
--admin-user sqladmin \
--admin-password "<strong-password>"
# Create a secondary SQL server (West US) for DR
az sql server create \
--resource-group rg-data-westus \
--name sql-myapp-westus \
--location westus \
--admin-user sqladmin \
--admin-password "<strong-password>"
# Create a database on the primary server
az sql db create \
--resource-group rg-data-eastus \
--server sql-myapp-eastus \
--name db-myapp \
--service-objective S3 \
--backup-storage-redundancy Geo
# Create an auto-failover group
az sql failover-group create \
--resource-group rg-data-eastus \
--server sql-myapp-eastus \
--partner-server sql-myapp-westus \
--partner-resource-group rg-data-westus \
--name fog-myapp \
--add-db db-myapp \
--failover-policy Automatic \
--grace-period 1
# The failover group creates a listener endpoint:
# fog-myapp.database.windows.net (always points to primary)
# fog-myapp.secondary.database.windows.net (always points to secondary)
# Your application should connect to the listener endpoint
# Test failover (manual trigger)
az sql failover-group set-primary \
--resource-group rg-data-westus \
--server sql-myapp-westus \
--name fog-myapp
# Fail back to original primary
az sql failover-group set-primary \
--resource-group rg-data-eastus \
--server sql-myapp-eastus \
--name fog-myappDR Testing & Drills
A disaster recovery plan that has never been tested is not a plan; it is a hope. Regular DR testing validates that your recovery procedures work, identifies gaps in automation, trains your team on the failover process, and provides evidence for compliance auditors. Azure Site Recovery supports test failovers that create isolated recovery VMs without impacting production or replication.
Types of DR Tests
| Test Type | Impact | Frequency | Scope |
|---|---|---|---|
| Tabletop Exercise | None (discussion only) | Quarterly | Walk through scenarios, review runbooks |
| Test Failover (ASR) | None (isolated network) | Monthly | Validate replication, test recovery plans |
| Planned Failover | Controlled downtime | Bi-annually | Full end-to-end failover with DNS switch |
| Unannounced Drill | Real failover | Annually | Chaos engineering; test team readiness |
# Create an isolated virtual network for the test
$testVnet = New-AzVirtualNetwork `
-Name "vnet-dr-test" `
-ResourceGroupName "rg-dr-testing" `
-Location "westus2" `
-AddressPrefix "10.99.0.0/16"
Add-AzVirtualNetworkSubnetConfig `
-Name "snet-test" `
-VirtualNetwork $testVnet `
-AddressPrefix "10.99.1.0/24"
$testVnet | Set-AzVirtualNetwork
# Execute test failover using the recovery plan
$plan = Get-AzRecoveryServicesAsrRecoveryPlan -Name "rp-myapp-full"
Start-AzRecoveryServicesAsrTestFailoverJob `
-RecoveryPlan $plan `
-Direction PrimaryToRecovery `
-AzureVMNetworkId $testVnet.Id
# Monitor test failover progress
Get-AzRecoveryServicesAsrJob | `
Where-Object { $_.State -eq "InProgress" } | `
Select-Object Name, State, StartTime, `
@{N="Duration"; E={(Get-Date) - $_.StartTime}}
# After validation: Clean up test failover resources
Start-AzRecoveryServicesAsrTestFailoverCleanupJob `
-RecoveryPlan $plan `
-Comment "DR test completed successfully. All services validated."DR Test Automation
Automate your DR tests using Azure DevOps Pipelines or Azure Automation. Create a scheduled pipeline that performs a test failover monthly, runs automated validation checks (HTTP health endpoints, database connectivity, data integrity), captures screenshots of the running application, and sends a report to the DR team. This ensures DR is tested consistently without relying on manual scheduling and execution.
Best Practices & Cost Optimization
Building and maintaining a disaster recovery capability requires ongoing investment in infrastructure, automation, and testing. The following best practices help you build effective DR while managing costs.
Architecture Best Practices
- Use PaaS services where possible: Platform services like Azure SQL Database, Cosmos DB, and App Service have built-in geo-replication and failover capabilities that are simpler and more reliable than managing DR for IaaS VMs.
- Automate everything: Manual DR procedures are slow, error-prone, and stressful during an actual disaster. Invest in automation through recovery plans, runbooks, and infrastructure-as-code so that failover is a button press, not a multi-page runbook.
- Design for failback: Failing over to the DR region is only half the challenge. Plan and test the failback process (returning to the primary region) before you need it. Failback is often more complex because data has been modified in the DR region during the outage.
- Document everything: Maintain up-to-date DR documentation including architecture diagrams, runbooks, contact lists, escalation procedures, and configuration details. Store documentation in a location accessible during an outage (not solely in the primary region).
- Test regularly: Conduct DR tests at least quarterly using a mix of tabletop exercises, test failovers, and planned failovers. Each test should produce a report documenting what worked, what did not, and action items for improvement.
Cost Optimization
- Use Azure Site Recovery for VM replication: ASR costs approximately $25/month per replicated VM, which is far cheaper than running idle VMs in a warm standby DR region.
- Right-size DR resources: The DR region does not need to match the production region's capacity. Use smaller VM sizes in the DR region and scale up as part of the failover process (at the cost of a slightly longer RTO).
- Leverage reserved instances strategically: If you use Reserved Instances in your primary region, Azure allows you to use the same reservations in the DR region during failover (instance size flexibility).
- Use Azure Hybrid Benefit: Apply existing Windows Server and SQL Server licenses to DR VMs to reduce compute costs.
- Monitor DR costs separately: Tag all DR resources with a consistent tag (e.g.,
purpose: disaster-recovery) to track DR costs independently and ensure they remain proportional to the risk they mitigate. - Review DR scope regularly: As your application evolves, some components may become less critical while new critical services are added. Review your DR scope quarterly to ensure you are protecting the right things.
# Check ASR replication health for all VMs
az resource list \
--resource-type "Microsoft.RecoveryServices/vaults" \
--query "[].{Name:name, ResourceGroup:resourceGroup, Location:location}" \
--output table
# View backup jobs status
az backup job list \
--resource-group rg-backup \
--vault-name rsv-backup-prod \
--query "[?status=='InProgress' || status=='Failed'].{Job:name, Status:status, StartTime:startTime}" \
--output table
# Check Traffic Manager endpoint health
az network traffic-manager profile show \
--resource-group rg-dr \
--name tm-myapp-dr \
--query '{Profile:name, Status:profileStatus, Endpoints:endpoints[].{Name:name, Status:endpointStatus, Monitor:endpointMonitorStatus}}' \
--output json
# Estimate DR costs with resource tags
az cost management query \
--type Usage \
--timeframe MonthToDate \
--dataset-filter '{"Tags": {"Name": "purpose", "Values": ["disaster-recovery"]}}' \
--query "properties.rows[].{Service:[0], Cost:[1]}" \
--output tableAzure Chaos Studio
Azure Chaos Studio is a managed chaos engineering service that helps you test your application's resilience by deliberately injecting faults, such as VM shutdowns, network disconnections, DNS failures, and CPU/memory stress. Use Chaos Studio alongside your DR testing program to validate that your application degrades gracefully under partial failures and that your DR automation triggers correctly when real infrastructure problems occur.
Key Takeaways
- 1Azure Site Recovery (ASR) provides continuous replication of VMs to a secondary region.
- 2Paired regions ensure data residency compliance and prioritized recovery during outages.
- 3Recovery plans automate multi-tier application failover with custom scripts and sequencing.
- 4Azure Backup provides application-consistent snapshots for RPOs from hours to minutes.
- 5Traffic Manager and Front Door provide DNS-based and Layer 7 failover respectively.
- 6Regular DR drills using test failover validate recovery procedures without affecting production.
Frequently Asked Questions
What is Azure Site Recovery?
What are Azure paired regions?
How does test failover work?
What is the RPO for Azure Site Recovery?
How much does disaster recovery cost on Azure?
Written by CloudToolStack Team
Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.
Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.