Skip to main content
OCIArchitectureadvanced

OCI Disaster Recovery

Implement DR strategies on OCI from backup/restore to active-active, using Full Stack DR, cross-region replication, and Data Guard.

CloudToolStack Team26 min readPublished Mar 14, 2026

Prerequisites

  • Understanding of OCI networking and compute services
  • Familiarity with RTO/RPO concepts

Introduction to Disaster Recovery on OCI

Disaster recovery (DR) is the process of restoring critical IT systems and data after a catastrophic event such as a regional outage, natural disaster, or infrastructure failure. Oracle Cloud Infrastructure provides a comprehensive set of DR capabilities, from basic cross-region backups to fully automated failover with the Full Stack Disaster Recovery (FSDR) service. Understanding and implementing DR is essential for any production workload where downtime translates directly to revenue loss or regulatory non-compliance.

OCI's DR approach centers on three key metrics: Recovery Time Objective (RTO), which defines the maximum acceptable downtime; Recovery Point Objective (RPO), which defines the maximum acceptable data loss; and the cost of maintaining the DR infrastructure. Different applications have different requirements: a static marketing website might tolerate hours of downtime, while a financial trading platform needs sub-minute failover with zero data loss.

This guide covers the complete spectrum of OCI DR strategies, from cold standby (lowest cost, highest RTO) to active-active multi-region (highest cost, near-zero RTO/RPO). You will learn how to configure cross-region replication for storage and databases, set up Full Stack DR for automated failover, design multi-region architectures, and conduct DR drills to validate your recovery procedures.

DR Terminology

RTO (Recovery Time Objective): Maximum acceptable time from disaster to restored service. RPO (Recovery Point Objective): Maximum acceptable data loss measured in time (e.g., RPO of 1 hour means you can lose up to 1 hour of data). Primary region: The region running your production workload. Standby region: The region where DR resources are maintained. Failover: Switching production traffic from primary to standby. Switchback: Returning production traffic from standby back to primary after the disaster is resolved.

DR Strategy Tiers

OCI supports four standard DR tiers, each offering a different balance of cost, complexity, and recovery speed. Choose the tier based on your application's RTO/RPO requirements and your budget for standby infrastructure.

StrategyRTORPOStandby CostDescription
Backup & Restore24+ hours24 hoursMinimal (storage only)Periodic backups replicated cross-region; rebuild on failure
Pilot Light1-4 hoursMinutes to hoursLow (DB replicas only)Core data replicated; compute provisioned on failover
Warm Standby15-60 minutesMinutesMedium (scaled-down infra)Scaled-down copy running in standby; scale up on failover
Active-ActiveNear-zeroNear-zeroHigh (full duplicate)Both regions serve traffic; DNS-based failover

Cross-Region Data Replication

The foundation of any DR strategy is ensuring your data is replicated to the standby region. OCI provides native cross-region replication for Object Storage, Block Volumes, Boot Volumes, and databases. Each service has its own replication mechanism with different RPO characteristics.

Object Storage Cross-Region Replication

bash
# Enable cross-region replication on a bucket
oci os replication create-replication-policy \
  --namespace <namespace> \
  --bucket-name "application-data" \
  --name "dr-replication-to-phoenix" \
  --destination-region "us-phoenix-1" \
  --destination-bucket "application-data-dr"

# List replication policies for a bucket
oci os replication list-replication-policies \
  --namespace <namespace> \
  --bucket-name "application-data" \
  --query 'data[].{"Name": name, "Destination": "destination-region-name", "Status": status}' \
  --output table

# Check replication status
oci os replication get-replication-policy \
  --namespace <namespace> \
  --bucket-name "application-data" \
  --replication-id <replication-policy-id>

# Cross-region copy for Block Volume backups
oci bv boot-volume-backup copy \
  --boot-volume-backup-id <backup-ocid> \
  --destination-region "us-phoenix-1"

oci bv volume-backup copy \
  --volume-backup-id <backup-ocid> \
  --destination-region "us-phoenix-1"

# Automate block volume backups with a backup policy
oci bv volume-backup-policy create \
  --compartment-id <compartment-ocid> \
  --display-name "dr-backup-policy" \
  --schedules '[{
    "backupType": "INCREMENTAL",
    "period": "ONE_HOUR",
    "retentionSeconds": 604800,
    "timeZone": "UTC",
    "hourOfDay": 0,
    "offsetType": "STRUCTURED"
  }]'

Database Cross-Region Replication

bash
# Create a cross-region Data Guard association for Autonomous Database
oci db autonomous-database create-cross-region-data-guard-details \
  --compartment-id <standby-compartment-ocid> \
  --source-id <primary-adb-ocid> \
  --display-name "finance-db-standby" \
  --db-name "financedr" \
  --is-dedicated false

# For DB System (BaseDB), create a Data Guard association
oci db data-guard-association create with-new-db-system \
  --database-id <primary-database-ocid> \
  --creation-type NewDbSystem \
  --protection-mode MAXIMUM_PERFORMANCE \
  --transport-type ASYNC \
  --peer-db-system-id <standby-db-system-ocid> \
  --peer-region "us-phoenix-1"

# Check Data Guard association status
oci db data-guard-association list \
  --database-id <primary-database-ocid> \
  --query 'data[].{"Role": role, "Peer Role": "peer-role", "State": "lifecycle-state", "Lag (sec)": "apply-lag", "Transport Lag": "transport-lag"}' \
  --output table

# Manual failover (for testing or actual DR)
oci db data-guard-association failover \
  --database-id <standby-database-ocid> \
  --data-guard-association-id <association-ocid> \
  --database-admin-password "<admin-password>"

# Switchover (planned, no data loss)
oci db data-guard-association switchover \
  --database-id <primary-database-ocid> \
  --data-guard-association-id <association-ocid> \
  --database-admin-password "<admin-password>"

Data Guard Transport Modes

Choose your Data Guard transport mode based on RPO requirements. MAXIMUM PERFORMANCE (async) has minimal impact on primary performance but allows some data loss during unplanned failover. MAXIMUM AVAILABILITY (sync with async fallback) provides zero data loss under normal conditions but falls back to async if the standby becomes unreachable. MAXIMUM PROTECTION (strict sync) guarantees zero data loss but halts the primary if the standby is unreachable. For cross-region DR, MAXIMUM PERFORMANCE is recommended because synchronous replication across regions adds significant latency.

OCI Full Stack Disaster Recovery

OCI Full Stack Disaster Recovery (FSDR) is a managed service that orchestrates failover and switchback across your entire application stack, not just databases. FSDR coordinates compute, networking, load balancers, databases, Object Storage, and custom scripts into a single DR plan that executes automatically or with one-click manual initiation.

FSDR uses the concept of DR Protection Groups that contain the resources in a region. You create a primary protection group and a standby protection group, then create a DR Plan that defines the steps to execute during failover and switchback. Plans support built-in steps for OCI resource operations and custom steps that run user-defined scripts.

bash
# Create a DR Protection Group for the primary region
oci disaster-recovery dr-protection-group create \
  --compartment-id <compartment-ocid> \
  --display-name "ecommerce-primary" \
  --log-location '{
    "namespace": "<namespace>",
    "bucket": "dr-logs"
  }' \
  --members '[
    {
      "memberType": "COMPUTE_INSTANCE",
      "memberId": "<web-server-ocid>",
      "isStartStopEnabled": true,
      "destinationCompartmentId": "<standby-compartment-ocid>",
      "destinationDedicatedVmHostId": null
    },
    {
      "memberType": "AUTONOMOUS_DATABASE",
      "memberId": "<adb-ocid>"
    },
    {
      "memberType": "LOAD_BALANCER",
      "memberId": "<lb-ocid>",
      "destinationLoadBalancerId": "<standby-lb-ocid>"
    },
    {
      "memberType": "NETWORK_LOAD_BALANCER",
      "memberId": "<nlb-ocid>",
      "destinationNetworkLoadBalancerId": "<standby-nlb-ocid>"
    }
  ]'

# Associate primary and standby protection groups
oci disaster-recovery dr-protection-group update \
  --dr-protection-group-id <primary-pg-ocid> \
  --association '{
    "role": "PRIMARY",
    "peerId": "<standby-pg-ocid>",
    "peerRegion": "us-phoenix-1"
  }'

Creating and Executing DR Plans

bash
# Create a failover DR plan
oci disaster-recovery dr-plan create \
  --display-name "ecommerce-failover-plan" \
  --dr-protection-group-id <primary-pg-ocid> \
  --type FAILOVER

# Create a switchback plan
oci disaster-recovery dr-plan create \
  --display-name "ecommerce-switchback-plan" \
  --dr-protection-group-id <primary-pg-ocid> \
  --type SWITCHOVER

# View the auto-generated plan steps
oci disaster-recovery dr-plan get \
  --dr-plan-id <plan-ocid> \
  --query 'data."plan-groups"[].{"Group": "display-name", "Type": type, "Steps": steps[].{"Name": "display-name", "Type": type}}' \
  --output json

# Add a custom step (e.g., invalidate CDN cache after failover)
oci disaster-recovery dr-plan update \
  --dr-plan-id <plan-ocid> \
  --plan-groups '[
    {
      "id": "<existing-group-id>",
      "type": "USER_DEFINED",
      "displayName": "Post-Failover Tasks",
      "steps": [
        {
          "displayName": "Invalidate CDN Cache",
          "type": "RUN_LOCAL_SCRIPT_USER_DEFINED",
          "userDefinedStep": {
            "stepType": "RUN_LOCAL_SCRIPT",
            "runOnInstanceId": "<mgmt-instance-ocid>",
            "scriptCommand": "/opt/scripts/invalidate-cdn.sh"
          }
        },
        {
          "displayName": "Verify Application Health",
          "type": "RUN_LOCAL_SCRIPT_USER_DEFINED",
          "userDefinedStep": {
            "stepType": "RUN_LOCAL_SCRIPT",
            "runOnInstanceId": "<mgmt-instance-ocid>",
            "scriptCommand": "/opt/scripts/health-check.sh",
            "runAsUser": "opc"
          }
        }
      ]
    }
  ]'

# Execute a DR drill (precheck without actual failover)
oci disaster-recovery dr-plan-execution create \
  --display-name "quarterly-dr-drill-2026-Q1" \
  --dr-protection-group-id <primary-pg-ocid> \
  --plan-id <plan-ocid> \
  --execution-options '{
    "planExecutionType": "FAILOVER_PRECHECK"
  }'

# Execute actual failover
oci disaster-recovery dr-plan-execution create \
  --display-name "failover-2026-03-14" \
  --dr-protection-group-id <primary-pg-ocid> \
  --plan-id <plan-ocid> \
  --execution-options '{
    "planExecutionType": "FAILOVER",
    "arePrechecksEnabled": true,
    "areWarningsIgnored": false
  }'

# Monitor execution progress
oci disaster-recovery dr-plan-execution get \
  --dr-plan-execution-id <execution-ocid> \
  --query 'data.{"Status": "lifecycle-state", "Started": "time-started", "Ended": "time-ended", "Groups": "group-executions"[].{"Name": "display-name", "Status": status}}' \
  --output json

DR Drills Are Essential

A DR plan that has never been tested is not a DR plan. OCI FSDR provides precheck execution that validates all plan steps without performing the actual failover. Run prechecks monthly and perform full DR drills quarterly. Document the results, including any failures and the time to complete each step. This practice reveals configuration drift, permission issues, and capacity constraints in the standby region before an actual disaster occurs.

Multi-Region Architecture Design

For applications requiring the lowest possible RTO and RPO, an active-active multi-region architecture runs the full application stack in two or more regions simultaneously. Traffic is distributed across regions using DNS-based load balancing (OCI Traffic Management), and each region has its own database instance with cross-region replication.

bash
# Set up DNS-based traffic management for multi-region
# Create a health check for the primary region endpoint
oci health-checks http-monitor create \
  --compartment-id <compartment-ocid> \
  --display-name "primary-health-check" \
  --targets '["primary-lb.us-ashburn-1.example.com"]' \
  --protocol HTTPS \
  --port 443 \
  --path "/health" \
  --interval-in-seconds 30 \
  --timeout-in-seconds 10 \
  --is-enabled true

# Create a health check for the standby region
oci health-checks http-monitor create \
  --compartment-id <compartment-ocid> \
  --display-name "standby-health-check" \
  --targets '["standby-lb.us-phoenix-1.example.com"]' \
  --protocol HTTPS \
  --port 443 \
  --path "/health" \
  --interval-in-seconds 30 \
  --timeout-in-seconds 10 \
  --is-enabled true

# Create a traffic management steering policy for failover
oci dns steering-policy create \
  --compartment-id <compartment-ocid> \
  --display-name "app-dr-failover" \
  --template FAILOVER \
  --ttl 60 \
  --answers '[
    {"name": "primary", "rtype": "A", "rdata": "<primary-lb-ip>", "pool": "primary", "isDisabled": false},
    {"name": "standby", "rtype": "A", "rdata": "<standby-lb-ip>", "pool": "standby", "isDisabled": false}
  ]' \
  --rules '[
    {
      "ruleType": "FILTER",
      "defaultAnswerData": [
        {"answerCondition": "answer.isDisabled != true", "shouldKeep": true}
      ]
    },
    {
      "ruleType": "HEALTH",
      "cases": [
        {"caseCondition": "answer.pool == \u0027primary\u0027"}
      ]
    },
    {
      "ruleType": "PRIORITY",
      "defaultAnswerData": [
        {"answerCondition": "answer.pool == \u0027primary\u0027", "value": 1},
        {"answerCondition": "answer.pool == \u0027standby\u0027", "value": 2}
      ]
    }
  ]' \
  --health-check-monitor-id <primary-health-check-ocid>

DR for Specific OCI Services

Different OCI services have different DR capabilities and requirements. Here is a reference for the most commonly used services and their DR mechanisms.

ServiceDR MechanismRPOConfiguration
Autonomous DBCross-region Data GuardNear-zero (async)Automatic standby creation
DB SystemData GuardConfigurableManual association setup
Object StorageCross-region replicationMinutesReplication policy per bucket
Block VolumeCross-region backup copyBackup frequencyScheduled backup policy
ComputeCustom image copyImage ageCopy images to standby region
OKEMulti-cluster deploymentApplication-dependentSeparate clusters per region
File StorageCross-region replicationMinutesReplication target in standby
VaultCross-region replicationNear-instantVirtual private vault replication

Testing and Validating Your DR Plan

A comprehensive DR validation program includes multiple types of tests at different frequencies. Each test type validates different aspects of your DR readiness and involves different levels of risk and effort.

bash
#!/bin/bash
# dr-validation-script.sh - Automated DR readiness checks

echo "=== OCI DR Readiness Report ==="
echo "Date: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo ""

# Check Data Guard status for all databases
echo "--- Database Data Guard Status ---"
oci db data-guard-association list \
  --database-id <primary-db-ocid> \
  --query 'data[].{"Role": role, "State": "lifecycle-state", "Apply Lag": "apply-lag", "Transport Lag": "transport-lag"}' \
  --output table

# Check Object Storage replication status
echo ""
echo "--- Object Storage Replication ---"
for bucket in "app-data" "user-uploads" "config-backups"; do
  echo "Bucket: $bucket"
  oci os replication list-replication-policies \
    --namespace <namespace> \
    --bucket-name "$bucket" \
    --query 'data[].{"Destination": "destination-region-name", "Status": status}' \
    --output table
done

# Check Block Volume backup freshness
echo ""
echo "--- Block Volume Backups (last 24h) ---"
oci bv volume-backup list \
  --compartment-id <compartment-ocid> \
  --query 'data[?timeCreated > `2026-03-13`].{"Name": "display-name", "State": "lifecycle-state", "Created": "time-created", "Size (GB)": "size-in-gbs"}' \
  --output table

# Verify standby region resources exist
echo ""
echo "--- Standby Region Resources ---"
oci compute instance list \
  --compartment-id <standby-compartment-ocid> \
  --region us-phoenix-1 \
  --query 'data[].{"Name": "display-name", "State": "lifecycle-state", "Shape": shape}' \
  --output table

# Check FSDR protection group status
echo ""
echo "--- DR Protection Groups ---"
oci disaster-recovery dr-protection-group list \
  --compartment-id <compartment-ocid> \
  --query 'data.items[].{"Name": "display-name", "Role": role, "State": "lifecycle-state"}' \
  --output table

# Check DNS steering policy health
echo ""
echo "--- DNS Traffic Management ---"
oci dns steering-policy list \
  --compartment-id <compartment-ocid> \
  --query 'data[].{"Name": "display-name", "State": "lifecycle-state", "Template": template}' \
  --output table

echo ""
echo "=== DR Readiness Check Complete ==="

Common DR Pitfalls

The most common reasons DR failovers fail in practice: (1) IAM policies in the standby region are out of date or missing, (2) networking configuration (VCN peering, security lists, NSGs) was not replicated, (3) application secrets and certificates in the standby region have expired, (4) compute shapes used in primary are not available in the standby region, (5) capacity reservations were not created in the standby region. Validate all of these during your quarterly DR drills.

Cost Optimization for DR

DR infrastructure represents a cost that provides value only during a disaster. Optimizing this cost without compromising recovery capability requires careful architecture decisions. Use smaller compute shapes in the standby region and scale up during failover. Use preemptible instances for non-critical DR testing. Leverage OCI's Always Free resources where possible for basic monitoring and management in the standby region.

For the pilot light strategy, keep only database replicas and minimal management infrastructure running in the standby region. Use Terraform or Resource Manager stacks to provision the full compute and networking stack on demand during failover. This approach costs only the database replication and Object Storage fees, which are a fraction of running the full stack.

For warm standby, run a scaled-down version of your application (for example, one instance instead of four) that handles the standby database's read-only workload. This validates that the application works correctly in the standby region while keeping compute costs low. Scale up to production capacity only during failover using instance pools or OKE autoscaling.

Document your DR procedures, train your operations team, and keep runbooks updated as your application architecture evolves. An automated DR plan that nobody understands is almost as dangerous as no DR plan at all.

OCI VCN Networking Deep DiveOCI Security Best PracticesOCI Cost Optimization Strategies

Key Takeaways

  1. 1OCI supports four DR tiers: backup/restore, pilot light, warm standby, and active-active.
  2. 2Full Stack DR orchestrates failover across compute, networking, databases, and custom scripts.
  3. 3Cross-region replication is available for Object Storage, Block Volumes, and databases.
  4. 4Regular DR drills are essential to validate recovery procedures and detect configuration drift.

Frequently Asked Questions

What is OCI Full Stack Disaster Recovery?
Full Stack DR is a managed service that orchestrates failover and switchback across your entire application stack. It coordinates compute, networking, load balancers, databases, Object Storage, and custom scripts into a single DR plan. Plans support built-in steps for OCI operations and custom scripts for application-specific tasks.
How does Data Guard work for cross-region DR?
Data Guard maintains a synchronized standby database in a different region using redo log shipping. MAXIMUM PERFORMANCE mode (async) minimizes primary impact with minimal data loss. MAXIMUM AVAILABILITY mode provides zero data loss under normal conditions. Cross-region failover can be automatic (with FSDR) or manual via CLI/Console.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.