Skip to main content
GCPArchitectureadvanced

Disaster Recovery & Backup Guide

Implement disaster recovery on GCP with the Backup and DR service, persistent disk snapshots, cross-region replication, GKE backup, and database DR.

CloudToolStack Team26 min readPublished Feb 22, 2026

Prerequisites

Disaster Recovery on GCP

Disaster recovery (DR) is the process of restoring your applications and data to a functional state after a disruptive event, whether a regional cloud outage, a catastrophic software failure, a ransomware attack, or an accidental data deletion. On Google Cloud, DR planning leverages GCP's multi-region infrastructure, managed backup services, and cross-region replication capabilities to build resilient architectures that can recover from failures with minimal data loss and downtime.

Effective disaster recovery is not just about technology. It requires clear business requirements, documented procedures, regular testing, and organizational commitment. The cost of DR infrastructure must be balanced against the business impact of downtime. A financial trading platform requires near-zero downtime (costing millions per hour), while an internal reporting tool may tolerate hours of downtime (costing far less). Your DR strategy should reflect these realities.

GCP provides DR capabilities at multiple levels:

  • Infrastructure level: Multi-zone and multi-region deployments with automatic failover (managed instance groups, GKE multi-cluster, Cloud Spanner).
  • Data level: Cross-region replication (Cloud Storage, Cloud SQL, Bigtable), persistent disk snapshots, and managed backups.
  • Application level: Cloud Deploy for multi-region deployments, Cloud DNS for traffic failover, and Global Load Balancer for automatic backend health-based routing.
  • Managed service level: Backup and DR Service for centralized backup management across VMs, databases, and file systems.
DR StrategyRPORTOCostComplexity
Backup & RestoreHours to daysHours to daysLow ($)Low
Cold Standby (Pilot Light)Minutes to hoursHoursLow-Medium ($$)Medium
Warm StandbyMinutesMinutes to hoursMedium ($$$)Medium-High
Hot Standby (Active-Active)Near-zeroSeconds to minutesHigh ($$$$)High

DR Is Not High Availability

High availability (HA) and disaster recovery are complementary but distinct concepts. HA is about surviving routine failures (a single server crash, a disk failure, a zone outage) with automatic failover and minimal human intervention. DR is about recovering from large-scale, low-probability events (an entire region going offline, a catastrophic data corruption, a ransomware attack) that overwhelm your HA mechanisms. A well-architected system needs both: HA for day-to-day resilience and DR for catastrophic scenarios.

RPO & RTO Fundamentals

Every DR strategy is defined by two key metrics: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). These metrics drive every technical and financial decision in your DR architecture.

  • RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. RPO determines how frequently you need to replicate or back up data.
  • RTO (Recovery Time Objective): The maximum acceptable downtime after a disaster. An RTO of 4 hours means your application must be restored and operational within 4 hours of the disaster event. RTO determines how much standby infrastructure you need.

The relationship between RPO/RTO and cost is inversely proportional: tighter RPO and RTO targets require more expensive infrastructure (continuous replication, hot standby capacity, automated failover). The key is matching your DR targets to the business value of the workload.

Workload TypeTypical RPOTypical RTORecommended Strategy
Financial transactionsNear-zero< 1 minuteHot standby with synchronous replication
E-commerce platform< 5 minutes< 30 minutesWarm standby with async replication
SaaS application< 1 hour< 4 hoursWarm or cold standby
Internal tools< 24 hours< 24 hoursBackup and restore
Development environmentsDays (or no DR)DaysRebuild from IaC

Backup & DR Service

Google Cloud's Backup and DR Service (formerly Actifio GO) provides a centralized, policy-driven backup management platform for Compute Engine VMs, Cloud SQL databases, GKE workloads, and file systems. It replaces the need to manage individual snapshot schedules, export scripts, and backup rotation policies across disparate services.

Key features of the Backup and DR Service:

  • Centralized management: A single console for managing backups across all supported GCP services, with unified policies, schedules, and retention rules.
  • Application-consistent backups: Uses VSS (Windows) and file system freezing (Linux) to ensure backups capture a consistent application state.
  • Incremental backups: After the initial full backup, only changed blocks are captured, reducing backup time and storage costs.
  • Cross-region backup: Store backup copies in a different region for geographic redundancy.
  • Instant recovery: Mount backups directly as Compute Engine instances for rapid recovery without waiting for a full restore.
Persistent disk snapshot management
# Create a snapshot of a persistent disk
gcloud compute disks snapshot my-data-disk \
  --zone=us-central1-a \
  --snapshot-names=my-data-disk-$(date +%Y%m%d-%H%M%S) \
  --storage-location=us

# Create a snapshot schedule (automated backups)
gcloud compute resource-policies create snapshot-schedule daily-backup \
  --region=us-central1 \
  --max-retention-days=30 \
  --on-source-disk-delete=keep-auto-snapshots \
  --daily-schedule \
  --start-time=02:00 \
  --storage-location=us \
  --labels=backup-policy=daily

# Attach the snapshot schedule to a disk
gcloud compute disks add-resource-policies my-data-disk \
  --zone=us-central1-a \
  --resource-policies=daily-backup

# Create a cross-region snapshot (for DR)
gcloud compute snapshots create dr-snapshot-$(date +%Y%m%d) \
  --source-disk=my-data-disk \
  --source-disk-zone=us-central1-a \
  --storage-location=us-east1 \
  --labels=purpose=disaster-recovery

# Restore a disk from a snapshot
gcloud compute disks create restored-disk \
  --source-snapshot=my-data-disk-20240115-020000 \
  --zone=us-east1-b \
  --type=pd-ssd

# Create a VM from a snapshot (DR recovery)
gcloud compute instances create recovered-vm \
  --zone=us-east1-b \
  --machine-type=e2-standard-4 \
  --create-disk="boot=yes,auto-delete=yes,source-snapshot=my-boot-disk-snapshot"

# List all snapshots
gcloud compute snapshots list \
  --filter="labels.purpose=disaster-recovery" \
  --sort-by=~creationTimestamp

# Delete old snapshots
gcloud compute snapshots delete old-snapshot-name --quiet

Snapshot Consistency for Databases

Persistent disk snapshots are crash-consistent, not application-consistent, by default. This means a snapshot taken while a database is writing data may capture an inconsistent state. For databases running on Compute Engine (self-managed MySQL, PostgreSQL, MongoDB), always flush writes and freeze the file system before taking a snapshot, or use the database's native backup tool. For managed databases (Cloud SQL, Spanner), use the service's built-in backup mechanism instead of disk snapshots.

Cold Standby Strategy

A cold standby (also called "pilot light") strategy maintains the minimum infrastructure needed to restore your application in a DR region. The core data (databases, storage) is replicated to the DR region, but compute resources are not running. When a disaster occurs, you spin up compute resources, point them at the replicated data, and redirect traffic. This approach balances cost with recovery time: you pay only for storage replication during normal operations, but recovery takes time because infrastructure must be provisioned.

Cold standby components on GCP:

  • Cloud Storage: Use dual-region or multi-region buckets for automatic geographic redundancy, or use cross-region replication via Transfer Service for single-region buckets.
  • Cloud SQL: Configure cross-region read replicas that can be promoted to primary during failover.
  • Disk snapshots: Schedule cross-region snapshots that store copies in the DR region.
  • Infrastructure as Code: Maintain Terraform configurations that can deploy the full application stack in the DR region.
Cold standby architecture with Terraform
# Cold standby infrastructure - only data replication runs normally
# Compute resources are created only during DR activation

# Cross-region Cloud SQL read replica (always running, low cost)
resource "google_sql_database_instance" "dr_replica" {
  name                 = "my-app-db-dr-replica"
  master_instance_name = google_sql_database_instance.primary.name
  region               = "us-east1" # DR region
  database_version     = "POSTGRES_15"

  replica_configuration {
    failover_target = true
  }

  settings {
    tier              = "db-custom-2-7680" # Smaller than primary
    availability_type = "ZONAL"            # No HA needed for replica
    disk_size         = 100
    disk_type         = "PD_SSD"

    ip_configuration {
      ipv4_enabled    = false
      private_network = google_compute_network.dr_vpc.id
    }

    backup_configuration {
      enabled = false # Primary handles backups
    }
  }
}

# Multi-region Cloud Storage for automatic replication
resource "google_storage_bucket" "app_data" {
  name     = "my-app-data-multi-region"
  location = "US" # Multi-region: auto-replicates across US regions

  versioning {
    enabled = true
  }

  lifecycle_rule {
    condition {
      num_newer_versions = 5
    }
    action {
      type = "Delete"
    }
  }
}

# Cross-region snapshot schedule for Compute Engine disks
resource "google_compute_resource_policy" "dr_snapshots" {
  name   = "dr-snapshot-policy"
  region = "us-central1"

  snapshot_schedule_policy {
    schedule {
      hourly_schedule {
        hours_in_cycle = 4
        start_time     = "00:00"
      }
    }

    retention_policy {
      max_retention_days    = 14
      on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
    }

    snapshot_properties {
      storage_locations = ["us-east1"] # Store snapshots in DR region
      labels = {
        purpose = "disaster-recovery"
      }
    }
  }
}

# DR activation script (run manually or via automation)
# This creates compute resources in the DR region
# resource "google_compute_instance" "dr_app_server" {
#   count = var.dr_active ? var.instance_count : 0
#   ...
# }

Warm Standby Strategy

A warm standby strategy runs a scaled-down version of your production infrastructure in the DR region at all times. Unlike cold standby, the compute resources are already running (at reduced capacity), and data is continuously replicated. When a disaster occurs, you scale up the DR infrastructure to full capacity and redirect traffic. This significantly reduces RTO compared to cold standby because the infrastructure is already provisioned and healthy.

Warm standby architecture with Cloud Run and Cloud SQL
# Warm standby uses Cloud Run (scales to near-zero when idle)
# and Cloud SQL cross-region replicas

# Deploy the application to both regions
# Primary region
gcloud run deploy my-api \
  --image=us-central1-docker.pkg.dev/my-project/images/my-api:latest \
  --region=us-central1 \
  --min-instances=2 \
  --max-instances=100 \
  --set-env-vars=DB_HOST=/cloudsql/my-project:us-central1:primary-db

# DR region (minimal capacity, ready to scale)
gcloud run deploy my-api \
  --image=us-east1-docker.pkg.dev/my-project/images/my-api:latest \
  --region=us-east1 \
  --min-instances=1 \
  --max-instances=100 \
  --set-env-vars=DB_HOST=/cloudsql/my-project:us-east1:dr-replica-db

# Set up Global Load Balancer with health-based failover
# The GLB automatically routes traffic to the healthy region

# Create serverless NEGs for both regions
gcloud compute network-endpoint-groups create api-neg-primary \
  --region=us-central1 \
  --network-endpoint-type=serverless \
  --cloud-run-service=my-api

gcloud compute network-endpoint-groups create api-neg-dr \
  --region=us-east1 \
  --network-endpoint-type=serverless \
  --cloud-run-service=my-api

# Create backend service with both NEGs
gcloud compute backend-services create api-backend \
  --global \
  --load-balancing-scheme=EXTERNAL_MANAGED

# Primary backend (higher capacity)
gcloud compute backend-services add-backend api-backend \
  --global \
  --network-endpoint-group=api-neg-primary \
  --network-endpoint-group-region=us-central1 \
  --capacity-scaler=1.0

# DR backend (lower capacity, scales up on failover)
gcloud compute backend-services add-backend api-backend \
  --global \
  --network-endpoint-group=api-neg-dr \
  --network-endpoint-group-region=us-east1 \
  --capacity-scaler=0.5

# DR FAILOVER PROCEDURE:
# 1. Promote Cloud SQL replica to primary
gcloud sql instances promote-replica dr-replica-db

# 2. Update Cloud Run DR service to point to promoted database
gcloud run services update my-api \
  --region=us-east1 \
  --set-env-vars=DB_HOST=/cloudsql/my-project:us-east1:dr-replica-db

# 3. Scale up DR region capacity
gcloud run services update my-api \
  --region=us-east1 \
  --min-instances=5 \
  --max-instances=200

# 4. Update backend service to route all traffic to DR
gcloud compute backend-services update-backend api-backend \
  --global \
  --network-endpoint-group=api-neg-dr \
  --network-endpoint-group-region=us-east1 \
  --capacity-scaler=1.0

gcloud compute backend-services update-backend api-backend \
  --global \
  --network-endpoint-group=api-neg-primary \
  --network-endpoint-group-region=us-central1 \
  --capacity-scaler=0.0

Hot Standby & Multi-Region

A hot standby (active-active) strategy runs your application at full capacity in multiple regions simultaneously, with traffic distributed across all regions by the Global Load Balancer. Data is synchronously or near-synchronously replicated between regions. If one region fails, the remaining regions absorb the traffic automatically with no manual intervention. This provides the lowest possible RTO (seconds) and RPO (near-zero) but at the highest cost.

GCP services that natively support multi-region active-active:

ServiceMulti-Region CapabilityConsistency Model
Cloud SpannerNative multi-region with synchronous replicationStrong global consistency (external consistency)
FirestoreMulti-region deployment optionStrong consistency within region
Cloud StorageMulti-region and dual-region bucketsStrong read-after-write consistency
Cloud RunDeploy to multiple regions behind GLBStateless (depends on data layer)
GKEMulti-cluster with Multi-cluster IngressStateless (depends on data layer)
BigtableMulti-cluster replicationEventual consistency across clusters
Pub/SubGlobal by defaultAt-least-once delivery
Multi-region active-active with Cloud Spanner
# Cloud Spanner multi-region instance provides the strongest
# multi-region consistency guarantees of any cloud database

resource "google_spanner_instance" "multi_region" {
  name         = "global-transactions"
  config       = "nam-eur-asia1" # Multi-continent configuration
  display_name = "Global Transaction Database"
  # Processing units: 1000 = 1 node
  processing_units = 3000 # 3 nodes

  labels = {
    environment = "production"
    dr_tier     = "tier-1"
  }
}

resource "google_spanner_database" "orders" {
  instance = google_spanner_instance.multi_region.name
  name     = "orders"

  database_dialect = "GOOGLE_STANDARD_SQL"

  ddl = [
    <<-SQL
      CREATE TABLE Orders (
        OrderId STRING(36) NOT NULL,
        CustomerId STRING(36) NOT NULL,
        Status STRING(20) NOT NULL,
        TotalAmount NUMERIC NOT NULL,
        Region STRING(20) NOT NULL,
        CreatedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
        UpdatedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true)
      ) PRIMARY KEY (OrderId)
    SQL
    ,
    <<-SQL
      CREATE INDEX OrdersByCustomer ON Orders(CustomerId)
    SQL
  ]

  deletion_protection = true

  # Automated backups
  enable_drop_protection = true
}

# Spanner backup schedule
resource "google_spanner_backup_schedule" "daily" {
  instance = google_spanner_instance.multi_region.name
  database = google_spanner_database.orders.name

  name = "daily-backup"

  retention_duration = "2592000s" # 30 days

  spec {
    cron_spec {
      text = "0 2 * * *" # Daily at 2 AM
    }
  }

  full_backup_spec {}
}

# Multi-region Cloud Run deployment
resource "google_cloud_run_v2_service" "api_us" {
  name     = "orders-api"
  location = "us-central1"

  template {
    scaling {
      min_instance_count = 2
      max_instance_count = 100
    }

    containers {
      image = "us-central1-docker.pkg.dev/my-project/images/orders-api:latest"
      env {
        name  = "SPANNER_INSTANCE"
        value = google_spanner_instance.multi_region.name
      }
      env {
        name  = "SPANNER_DATABASE"
        value = "orders"
      }
    }
  }
}

resource "google_cloud_run_v2_service" "api_eu" {
  name     = "orders-api"
  location = "europe-west1"

  template {
    scaling {
      min_instance_count = 2
      max_instance_count = 100
    }

    containers {
      image = "europe-west1-docker.pkg.dev/my-project/images/orders-api:latest"
      env {
        name  = "SPANNER_INSTANCE"
        value = google_spanner_instance.multi_region.name
      }
      env {
        name  = "SPANNER_DATABASE"
        value = "orders"
      }
    }
  }
}

Cross-Region Replication Patterns

Different GCP services offer different replication mechanisms, each with trade-offs between consistency, latency, and cost. Understanding these replication patterns is essential for designing a DR architecture that meets your RPO requirements.

ServiceReplication MethodRPOConfiguration
Cloud SQL (PostgreSQL/MySQL)Cross-region read replica (async)Seconds to minutesCreate replica in DR region; promote on failover
Cloud SpannerSynchronous multi-region replicationZeroChoose multi-region instance config
Cloud StorageMulti-region/dual-region auto-replicationZero (RPO & turbo replication)Select multi-region or dual-region location
BigtableCluster replication (eventually consistent)SecondsAdd cluster in DR region to instance
FirestoreMulti-region or regional deploymentZero (multi-region) / N/A (regional)Choose multi-region location at creation
Memorystore RedisNo native cross-region replicationDepends on export frequencyScheduled RDB exports to cross-region GCS
Persistent DiskCross-region snapshots or async PD replicationMinutes to hoursSnapshot schedules with cross-region storage
Cross-region replication configuration
# Cloud SQL: Create a cross-region read replica
gcloud sql instances create primary-db-dr-replica \
  --master-instance-name=primary-db \
  --region=us-east1 \
  --tier=db-custom-4-15360 \
  --availability-type=ZONAL \
  --no-assign-ip \
  --network=projects/my-project/global/networks/my-vpc

# Bigtable: Add a cluster in the DR region
gcloud bigtable clusters create dr-cluster \
  --instance=my-bigtable \
  --zone=us-east1-b \
  --num-nodes=3 \
  --storage-type=SSD

# Cloud Storage: Create a dual-region bucket
gcloud storage buckets create gs://my-critical-data-dual \
  --location=us-central1+us-east1 \
  --default-storage-class=STANDARD \
  --enable-turbo-replication

# Cloud Storage: Replicate between single-region buckets
gcloud transfer jobs create \
  gs://source-bucket-central \
  gs://dr-bucket-east \
  --name=dr-replication \
  --schedule-starts=2024-01-01 \
  --schedule-repeats-every=1h \
  --overwrite-when=different \
  --delete-from=never

# Persistent Disk: Async replication (for critical disks)
gcloud compute disks start-async-replication my-data-disk \
  --zone=us-central1-a \
  --secondary-disk=dr-data-disk \
  --secondary-disk-zone=us-east1-b

# Check replication status
gcloud compute disks describe my-data-disk \
  --zone=us-central1-a \
  --format='yaml(asyncPrimaryDisk,asyncSecondaryDisks)'

Dual-Region with Turbo Replication

For Cloud Storage, dual-region buckets with turbo replication guarantee that 100% of newly written objects are replicated to both regions within 15 minutes (compared to the default target of "most objects within minutes"). This is critical for workloads where the RPO for object storage data must be measured in minutes rather than hours. Turbo replication adds a small premium to storage costs but provides a strong SLA-backed RPO guarantee.

GKE Backup & Restore

Backup for GKE is a managed service that backs up and restores entire GKE workloads, including Kubernetes resource configurations, persistent volume data, and custom resource definitions. It goes beyond simple etcd snapshots by providing application-aware backups that can selectively back up specific namespaces, protect persistent volumes, and restore to different clusters (including clusters in different regions for DR).

GKE Backup and Restore
# Enable the Backup for GKE API
gcloud services enable gkebackup.googleapis.com

# Create a backup plan for the production cluster
gcloud beta container backup-restore backup-plans create prod-daily-backup \
  --project=my-project \
  --location=us-central1 \
  --cluster=projects/my-project/locations/us-central1/clusters/prod-cluster \
  --all-namespaces \
  --include-volume-data \
  --include-secrets \
  --cron-schedule="0 2 * * *" \
  --backup-retain-days=30 \
  --backup-delete-lock-days=7 \
  --labels=team=platform

# Create a backup plan for specific namespaces only
gcloud beta container backup-restore backup-plans create app-backup \
  --project=my-project \
  --location=us-central1 \
  --cluster=projects/my-project/locations/us-central1/clusters/prod-cluster \
  --selected-namespaces=production,staging \
  --include-volume-data \
  --cron-schedule="0 */6 * * *" \
  --backup-retain-days=14

# Manually trigger an on-demand backup
gcloud beta container backup-restore backups create manual-backup-$(date +%Y%m%d) \
  --project=my-project \
  --location=us-central1 \
  --backup-plan=prod-daily-backup \
  --wait-for-completion

# List available backups
gcloud beta container backup-restore backups list \
  --project=my-project \
  --location=us-central1 \
  --backup-plan=prod-daily-backup

# Create a restore plan (can target a different cluster/region for DR)
gcloud beta container backup-restore restore-plans create dr-restore-plan \
  --project=my-project \
  --location=us-east1 \
  --backup-plan=projects/my-project/locations/us-central1/backupPlans/prod-daily-backup \
  --cluster=projects/my-project/locations/us-east1/clusters/dr-cluster \
  --all-namespaces \
  --volume-data-restore-policy=RESTORE_VOLUME_DATA_FROM_BACKUP \
  --cluster-resource-conflict-policy=USE_BACKUP_VERSION

# Execute a restore
gcloud beta container backup-restore restores create dr-restore-$(date +%Y%m%d) \
  --project=my-project \
  --location=us-east1 \
  --restore-plan=dr-restore-plan \
  --backup=projects/my-project/locations/us-central1/backupPlans/prod-daily-backup/backups/manual-backup-20240115 \
  --wait-for-completion

Database DR Strategies

Databases are typically the most critical component in a DR strategy because they hold the application state that cannot be recreated. Each GCP database service offers different DR capabilities, and the right approach depends on your consistency requirements, RPO targets, and budget.

Cloud SQL DR

Cloud SQL supports three levels of DR protection: automated backups (RPO: up to 24 hours), point-in-time recovery using write-ahead logs (RPO: seconds), and cross-region read replicas (RPO: seconds, RTO: minutes). For production workloads, combine all three for defense in depth.

Cloud SQL DR configuration
# Create a Cloud SQL instance with HA and automated backups
gcloud sql instances create primary-db \
  --database-version=POSTGRES_15 \
  --tier=db-custom-4-15360 \
  --region=us-central1 \
  --availability-type=REGIONAL \
  --backup-start-time=02:00 \
  --enable-bin-log \
  --enable-point-in-time-recovery \
  --retained-backups-count=30 \
  --retained-transaction-log-days=7 \
  --network=projects/my-project/global/networks/my-vpc \
  --no-assign-ip

# Create a cross-region replica for DR
gcloud sql instances create primary-db-dr \
  --master-instance-name=primary-db \
  --region=us-east1 \
  --tier=db-custom-4-15360 \
  --availability-type=ZONAL \
  --network=projects/my-project/global/networks/my-vpc \
  --no-assign-ip

# Check replication status
gcloud sql instances describe primary-db-dr \
  --format='yaml(replicaConfiguration,replicaNames)'

# DR FAILOVER: Promote the replica to a standalone instance
# WARNING: This is irreversible - the replica becomes independent
gcloud sql instances promote-replica primary-db-dr

# After promotion, create a new replica for continued DR protection
gcloud sql instances create primary-db-dr-new \
  --master-instance-name=primary-db-dr \
  --region=us-central1 \
  --tier=db-custom-4-15360

# Point-in-time recovery (to recover from data corruption)
gcloud sql instances clone primary-db recovered-db \
  --point-in-time='2024-01-15T14:30:00Z'

# Export database for additional backup (to Cloud Storage)
gcloud sql export sql primary-db \
  gs://my-backups/sql/primary-db-$(date +%Y%m%d).sql.gz \
  --database=myapp \
  --offload

Cloud SQL Replica Promotion Is Irreversible

When you promote a Cloud SQL cross-region read replica, it becomes a standalone primary instance. The replication link to the original primary is permanently severed and cannot be re-established. This means you must create a new replica after promotion to restore DR protection. Plan for this in your DR runbook and automate the post-failover replica creation. Also note that promotion can take several minutes depending on the replication lag at the time of failover.

Testing & Best Practices

A disaster recovery plan that has not been tested is not a plan; it is a hope. Regular DR testing is essential to validate that your recovery procedures work, your team knows the runbook, and your RPO/RTO targets are achievable. Without testing, you will discover problems during an actual disaster, when the stakes are highest and the pressure is greatest.

DR Testing Framework

Test TypeFrequencyScopeImpact
Tabletop exerciseQuarterlyWalk through scenarios verballyNone (discussion only)
Component testMonthlyTest individual components (backup restore, replica promotion)Minimal (non-production)
Full failover testSemi-annuallyComplete failover to DR regionPlanned downtime or traffic split
Chaos engineeringOngoingInject failures to test resilienceControlled production impact

DR Runbook Checklist

Every DR plan should include a detailed runbook with the following elements:

  • Decision criteria: Clear triggers that initiate DR activation (e.g., "region-level outage lasting more than 30 minutes" or "Google Cloud Status Dashboard confirms regional impact").
  • Roles and contacts: Who authorizes the failover, who executes each step, and who communicates to stakeholders.
  • Step-by-step procedures: Detailed commands for each failover action, including database promotion, DNS updates, and traffic shifting.
  • Validation checks: How to verify that the DR environment is functioning correctly after failover.
  • Failback procedures: How to return to the primary region after the disaster is resolved, including re-establishing replication.
  • Communication plan: Templates for notifying customers, stakeholders, and support teams about the failover status.
DR test automation script
#!/bin/bash
# dr-test.sh - Automated DR testing script
# Run this monthly against a test environment

set -euo pipefail

PROJECT_ID="my-project"
PRIMARY_REGION="us-central1"
DR_REGION="us-east1"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

echo "=== DR Test Started: $TIMESTAMP ==="

# Test 1: Verify backup freshness
echo "--- Test 1: Checking backup freshness ---"
LATEST_BACKUP=$(gcloud sql backups list --instance=primary-db \
  --filter="status=SUCCESSFUL" \
  --sort-by=~startTime \
  --limit=1 \
  --format='value(startTime)')
echo "Latest Cloud SQL backup: $LATEST_BACKUP"

LATEST_SNAPSHOT=$(gcloud compute snapshots list \
  --filter="labels.purpose=disaster-recovery" \
  --sort-by=~creationTimestamp \
  --limit=1 \
  --format='value(creationTimestamp)')
echo "Latest disk snapshot: $LATEST_SNAPSHOT"

# Test 2: Verify cross-region replication status
echo "--- Test 2: Checking replication status ---"
REPLICA_STATUS=$(gcloud sql instances describe primary-db-dr \
  --format='value(state)')
echo "Cloud SQL replica status: $REPLICA_STATUS"

if [ "$REPLICA_STATUS" != "RUNNABLE" ]; then
  echo "CRITICAL: DR replica is not in RUNNABLE state!"
  exit 1
fi

# Test 3: Test snapshot restore (to a temporary disk)
echo "--- Test 3: Testing snapshot restore ---"
gcloud compute disks create dr-test-restore-$TIMESTAMP \
  --source-snapshot=$LATEST_SNAPSHOT \
  --zone=$DR_REGION-b \
  --type=pd-ssd --quiet

echo "Snapshot restore successful"
gcloud compute disks delete dr-test-restore-$TIMESTAMP \
  --zone=$DR_REGION-b --quiet

# Test 4: Verify DR region Cloud Run service health
echo "--- Test 4: Checking DR region service health ---"
DR_URL=$(gcloud run services describe my-api \
  --region=$DR_REGION \
  --format='value(status.url)')

HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$DR_URL/health")
echo "DR service health check: HTTP $HTTP_STATUS"

if [ "$HTTP_STATUS" != "200" ]; then
  echo "WARNING: DR region service returned non-200 status"
fi

# Test 5: Measure actual RTO (time to promote replica)
echo "--- Test 5: Measuring simulated RTO ---"
echo "Simulated RTO measurement would require a full failover test"
echo "Schedule a full failover test for the next maintenance window"

echo "=== DR Test Completed: $(date +%Y%m%d-%H%M%S) ==="
echo "Results saved to gs://$PROJECT_ID-dr-reports/dr-test-$TIMESTAMP.log"

Automate Everything in Your DR Plan

The worst time to be running manual commands is during a disaster. Automate your DR procedures using scripts, Terraform, and Cloud Workflows. Store your DR runbooks in version control alongside your infrastructure code. Use Cloud Build triggers to execute DR scripts with a single manual approval. The goal is to reduce human decision-making during a crisis to a single question: "Should we activate DR?" Everything after that should be automated, tested, and repeatable.

Key Takeaways

  1. 1GCP Backup and DR service provides centralized, policy-driven backup management.
  2. 2Persistent disk snapshots provide incremental, cross-region backup for Compute Engine VMs.
  3. 3Cloud SQL HA configuration with automatic failover provides minutes-level RPO for databases.
  4. 4Spanner provides zero-RPO, zero-RTO multi-region deployments with strong consistency.
  5. 5GKE Backup for GKE protects Kubernetes workloads, volumes, and cluster configuration.
  6. 6Regular DR testing with non-disruptive drills validates recovery procedures.

Frequently Asked Questions

What is GCP Backup and DR?
Backup and DR (formerly Actifio GO) is a centralized backup management service. It provides policy-driven backup scheduling, retention management, and recovery for Compute Engine VMs, Cloud SQL databases, GKE workloads, and VMware VMs. It supports application-consistent backups and cross-region storage.
How do persistent disk snapshots work?
Snapshots capture the state of a persistent disk at a point in time. After the first full snapshot, subsequent snapshots are incremental (only changed blocks). Snapshots are stored in Cloud Storage and can be created in a different region than the source disk for DR purposes.
What is the RPO for Cloud SQL HA?
Cloud SQL HA with automatic failover provides an RPO of seconds (near-zero data loss) because it uses synchronous replication to a standby instance. Cross-region read replicas provide RPO of seconds to minutes using asynchronous replication. Failover takes approximately 30-120 seconds.
How do I protect GKE workloads?
Use Backup for GKE, which backs up Kubernetes resources (deployments, services, PVCs) and persistent volume data. You can back up entire namespaces or selected resources, restore to the same or different cluster, and schedule automated backups with retention policies.
How much does DR cost on GCP?
Costs vary by strategy. Persistent disk snapshots: $0.026/GB/month. Cloud SQL HA: approximately 2x standard instance cost. Spanner multi-region: 3x single-region cost. Backup and DR: per-protected-instance pricing. Cross-region data transfer: $0.01-$0.08/GB depending on regions.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.