Multi-CloudObservabilityintermediate

Monitoring & Observability Comparison

Compare monitoring across AWS, Azure, and GCP: CloudWatch vs Azure Monitor vs Cloud Operations, plus OpenTelemetry and third-party platforms.

CloudToolStack Editorial25 min readPublished Feb 22, 2026

Prerequisites

Basic understanding of observability concepts (metrics, logs, traces)
Experience with at least one cloud monitoring tool
Familiarity with distributed systems

Multi-Cloud Observability Overview

Observability, the ability to understand the internal state of your systems from their external outputs, is the foundation of operating reliable cloud infrastructure. Every major cloud provider offers a native observability stack that covers metrics, logs, and traces. AWS provides CloudWatch and X-Ray. Azure offers Azure Monitor and Application Insights. Google Cloud delivers Cloud Operations Suite (formerly Stackdriver). Each is deeply integrated with its provider's services, but each creates a silo that makes multi-cloud visibility challenging.

For organizations running workloads across multiple clouds, or those who want vendor-neutral tooling, third-party platforms like Datadog, New Relic, and Grafana Cloud provide a unified observability layer. OpenTelemetry has emerged as the open standard for telemetry collection, enabling portable instrumentation that works with any backend.

This guide compares native observability services across all three major providers, evaluates third-party alternatives, and provides practical guidance for building a unified observability strategy. We cover metrics collection, log aggregation, distributed tracing, alerting, and cost optimization for each approach.

The Three Pillars Plus More

Traditional observability focuses on three pillars: metrics, logs, and traces. Modern observability extends this with profiling (continuous profiling of CPU, memory, and allocations), real user monitoring (RUM), synthetic monitoring, and error tracking. All three cloud providers and most third-party platforms now cover these extended pillars. This guide primarily focuses on the core three pillars but touches on extended capabilities where they differentiate providers.

AWS CloudWatch & X-Ray

Amazon CloudWatch is the cornerstone of AWS observability. It collects metrics from over 70 AWS services automatically, stores and queries logs via CloudWatch Logs, and provides dashboards, alarms, and anomaly detection. CloudWatch Metrics supports custom metrics, high-resolution metrics (1-second granularity), and Metrics Insights for SQL-like querying across metric namespaces.

AWS X-Ray provides distributed tracing for applications running on AWS. It traces requests as they flow through API Gateway, Lambda, ECS, EKS, and downstream services. X-Ray integrates with the AWS Distro for OpenTelemetry (ADOT), allowing you to instrument applications with OpenTelemetry SDKs and send traces to X-Ray.

CloudWatch Key Capabilities

CloudWatch Logs Insights: SQL-like query language for log analysis with visualization support
CloudWatch Metrics Insights: Query metrics across namespaces using SQL syntax
CloudWatch Anomaly Detection: ML-based anomaly detection on metrics using bands
CloudWatch Synthetics: Canary scripts that monitor endpoints and APIs on a schedule
CloudWatch RUM: Real user monitoring for web applications with Core Web Vitals
CloudWatch Application Signals: APM for applications instrumented with OpenTelemetry
Amazon Managed Grafana: Fully managed Grafana with native CloudWatch data source

bash

# Query CloudWatch Logs Insights
aws logs start-query \
  --log-group-name /ecs/production/app \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, @message, @logStream
    | filter @message like /ERROR/
    | stats count(*) as errorCount by bin(5m)
    | sort @timestamp desc
    | limit 100
  '

# Create a CloudWatch alarm on a custom metric
aws cloudwatch put-metric-alarm \
  --alarm-name high-error-rate \
  --metric-name 5xxErrors \
  --namespace Custom/MyApp \
  --statistic Sum \
  --period 300 \
  --threshold 50 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --treat-missing-data notBreaching

# Create a CloudWatch Synthetics canary
aws synthetics create-canary \
  --name api-health-check \
  --artifact-s3-location s3://canary-artifacts/api-health/ \
  --execution-role-arn arn:aws:iam::123456789012:role/canary-role \
  --schedule "Expression=rate(5 minutes)" \
  --code "Handler=apiCanary.handler,S3Bucket=canary-code,S3Key=canary.zip" \
  --runtime-version syn-nodejs-puppeteer-7.0

Azure Monitor & Application Insights

Azure Monitor is a comprehensive monitoring platform that collects, analyzes, and acts on telemetry from Azure and hybrid environments. It encompasses several sub-services: Application Insights (APM and distributed tracing), Log Analytics (log query and storage), Azure Monitor Metrics (time-series database), Azure Monitor Alerts, and Azure Workbooks (interactive reporting). The Kusto Query Language (KQL) powers log analysis and is one of the most powerful query languages in the observability space.

Application Insights provides automatic instrumentation for .NET, Java, Node.js, and Python applications. It captures request traces, dependencies, exceptions, and performance counters with minimal code changes. The Application Map feature visualizes service dependencies and highlights performance bottlenecks.

Azure Monitor Key Capabilities

Log Analytics workspaces: Centralized log storage with KQL-based querying and 730-day retention
Application Insights: APM with auto-instrumentation, smart detection, and application map
Azure Monitor Managed Grafana: Fully managed Grafana instance with Azure AD integration
Azure Managed Prometheus: Prometheus-compatible metrics service for Kubernetes workloads
Change Analysis: Detects infrastructure and configuration changes correlated with incidents
Azure Workbooks: Interactive, parameterized reports combining metrics, logs, and text
Availability tests: Multi-location ping and URL tests with SSL certificate monitoring

bash

# Query Application Insights logs using KQL via Azure CLI
az monitor app-insights query \
  --app my-app-insights \
  --resource-group rg-monitoring \
  --analytics-query '
    requests
    | where timestamp > ago(1h)
    | where resultCode >= 500
    | summarize errorCount = count() by bin(timestamp, 5m), operation_Name
    | order by timestamp desc
  '

# Create an Azure Monitor alert rule
az monitor metrics alert create \
  --name high-response-time \
  --resource-group rg-monitoring \
  --scopes /subscriptions/<sub-id>/resourceGroups/rg-app/providers/Microsoft.Web/sites/myapp \
  --condition "avg requests/duration > 2000" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action-group ops-team-ag

# Create a Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group rg-monitoring \
  --workspace-name central-logs \
  --location eastus \
  --retention-in-days 90

KQL Is a Superpower

The Kusto Query Language (KQL) used by Azure Log Analytics is exceptionally powerful for log analysis. It supports joins, time-series analysis, machine learning functions, rendering charts, and external data enrichment. If your team invests in learning KQL, it pays dividends across Azure Monitor, Application Insights, Microsoft Sentinel (SIEM), and Azure Data Explorer. KQL is arguably the strongest log query language among the three providers.

GCP Cloud Operations Suite

Google Cloud Operations Suite (formerly Stackdriver) provides integrated monitoring, logging, tracing, profiling, and error reporting for Google Cloud workloads. Cloud Monitoring collects metrics from GCP services and supports custom metrics, uptime checks, and dashboard creation. Cloud Logging is a fully managed log storage and analysis service with a powerful query syntax. Cloud Trace provides distributed tracing that is tightly integrated with GCP services.

A unique strength of GCP's observability offering is Cloud Profiler, a continuous profiling service that captures CPU, memory, and heap profiles from production applications with minimal overhead (less than 0.5%). This enables production debugging without reproducing issues in staging environments.

Cloud Operations Key Capabilities

Cloud Monitoring: Metrics collection with MQL (Monitoring Query Language) and PromQL support
Cloud Logging: Centralized logging with log-based metrics, sinks to BigQuery/Pub/Sub, and advanced filters
Cloud Trace: Distributed tracing with automatic instrumentation for GCP services
Cloud Profiler: Continuous production profiling with less than 0.5% overhead
Error Reporting: Automatic grouping and tracking of application errors across services
Managed Prometheus: Google-managed Prometheus with global query across clusters and projects
Service Monitoring: SLO-based monitoring with error budget tracking

bash

# Query Cloud Logging with advanced filter
gcloud logging read '
  resource.type="k8s_container"
  AND resource.labels.cluster_name="prod-cluster"
  AND severity>=ERROR
  AND timestamp>="2024-01-15T00:00:00Z"
' --limit 100 --format json

# Create a log-based metric
gcloud logging metrics create error_count \
  --description="Count of error log entries" \
  --log-filter='severity>=ERROR AND resource.type="k8s_container"'

# Create an uptime check
gcloud monitoring uptime create \
  --display-name="API Health Check" \
  --resource-type=uptime-url \
  --hostname=api.example.com \
  --path=/health \
  --protocol=HTTPS \
  --period=60s \
  --timeout=10s \
  --regions=USA,EUROPE,ASIA_PACIFIC

# Create an alerting policy
gcloud monitoring policies create \
  --display-name="High Error Rate" \
  --condition-display-name="Error rate > 5%" \
  --condition-filter='metric.type="logging.googleapis.com/user/error_count" AND resource.type="k8s_container"' \
  --condition-threshold-value=50 \
  --condition-threshold-duration=300s \
  --notification-channels=projects/my-project/notificationChannels/12345

Feature-by-Feature Comparison

The following table compares the native observability services across all three providers. Each excels in different areas: AWS has the broadest service integration, Azure has the most powerful query language, and GCP has the strongest Kubernetes and SRE-native tooling.

Feature	AWS CloudWatch / X-Ray	Azure Monitor / App Insights	GCP Cloud Operations
Metrics storage	CloudWatch Metrics (15-month retention)	Azure Monitor Metrics (93-day retention)	Cloud Monitoring (13-month retention)
Log query language	CloudWatch Logs Insights (SQL-like)	KQL (Kusto Query Language)	Cloud Logging filter + BigQuery SQL
Distributed tracing	X-Ray	Application Insights (distributed trace)	Cloud Trace
APM	Application Signals (new)	Application Insights (mature)	Cloud Trace + Profiler
Prometheus support	Amazon Managed Prometheus (AMP)	Azure Managed Prometheus	Google Managed Prometheus
Grafana support	Amazon Managed Grafana	Azure Managed Grafana	Via Grafana Cloud or self-hosted
Continuous profiling	CodeGuru Profiler	App Insights Profiler (.NET)	Cloud Profiler (all languages)
SLO monitoring	CloudWatch ServiceLevelObjective	Application Insights SLA reports	Service Monitoring (native SLO)
OpenTelemetry support	ADOT (AWS Distro for OTel)	Azure Monitor OTel Exporter	Native OTel collector integration
Log retention (max)	Indefinite (pay per GB stored)	730 days (or archive to storage)	3,650 days (or export to BigQuery)

OpenTelemetry for Multi-Cloud

OpenTelemetry (OTel) is the CNCF project that provides vendor-neutral APIs, SDKs, and tools for generating, collecting, and exporting telemetry data (metrics, logs, and traces). It has become the industry standard for instrumentation and is the single most important technology for achieving multi-cloud observability portability.

By instrumenting your applications with OpenTelemetry SDKs, you decouple telemetry generation from the backend that stores and analyzes it. You can send the same telemetry data to CloudWatch, Azure Monitor, Cloud Operations, Datadog, or any other OTLP-compatible backend simply by changing the exporter configuration. This flexibility is invaluable for multi-cloud organizations.

otel-collector-config.yaml

# OpenTelemetry Collector configuration for multi-cloud
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape Prometheus metrics from Kubernetes pods
  prometheus:
    config:
      scrape_configs:
        - job_name: k8s-pods
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  # Add cloud-specific resource attributes
  resourcedetection:
    detectors: [env, system, aws, azure, gcp]
    timeout: 5s

  # Filter out noisy health check spans
  filter:
    spans:
      exclude:
        match_type: regexp
        attributes:
          - key: http.target
            value: "/(health|ready|live)"

exporters:
  # AWS CloudWatch / X-Ray
  awsxray:
    region: us-east-1
  awsemf:
    region: us-east-1
    namespace: MyApp

  # Azure Monitor
  azuremonitor:
    connection_string: InstrumentationKey=xxx;IngestionEndpoint=https://eastus-1.in.applicationinsights.azure.com/

  # GCP Cloud Operations
  googlecloud:
    project: my-gcp-project

  # Optional: Third-party backend
  otlp/datadog:
    endpoint: https://api.datadoghq.com:4317
    headers:
      dd-api-key: ${DD_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resourcedetection, filter]
      exporters: [awsxray, azuremonitor, googlecloud]
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, resourcedetection]
      exporters: [awsemf, azuremonitor, googlecloud]
    logs:
      receivers: [otlp]
      processors: [batch, resourcedetection]
      exporters: [awsemf, azuremonitor, googlecloud]

ADOT, Azure OTel, and GCP OTel Distributions

Each cloud provider offers their own distribution of the OpenTelemetry Collector: AWS Distro for OpenTelemetry (ADOT), Azure Monitor OpenTelemetry Distro, and Google Cloud's Ops Agent (which includes an OTel collector). These distributions include provider-specific exporters and receivers preconfigured. For multi-cloud deployments, use the upstream OpenTelemetry Collector with all three exporters configured, as shown above.

Third-Party Observability Platforms

Third-party observability platforms provide a unified view across all cloud providers, on-premises infrastructure, and SaaS applications. They eliminate the need to context- switch between provider-specific consoles and typically offer more advanced analytics, correlation, and incident management features than native tools.

Platform Comparison

Platform	Strengths	Pricing Model	Multi-Cloud Support
Datadog	Broadest integration library (750+), unified platform, strong APM	Per host + per GB ingested (can be expensive at scale)	Excellent (native integrations for all 3 clouds)
New Relic	Generous free tier (100 GB/mo), full-stack observability, AI-powered analysis	Per user + per GB ingested (above free tier)	Excellent (200+ cloud integrations)
Grafana Cloud	Open-source ecosystem (Grafana, Loki, Tempo, Mimir), Prometheus-native	Per active metrics series + per GB logs/traces	Excellent (uses open standards like Prometheus, OTel)
Splunk Observability	Enterprise-grade, strong log analytics, acquired SignalFx for APM	Per host + per GB ingested	Good (enterprise focus with cloud integrations)
Elastic Observability	Unified search across logs/metrics/traces, ELK stack ecosystem	Per resource unit (compute + storage)	Good (agent-based collection from any cloud)

When to Use Third-Party vs. Native

Use native observability tools when you operate primarily within a single cloud provider and want to minimize cost and complexity. The native tools are free or low-cost for basic usage, deeply integrated with provider services, and require no additional agent deployment.

Use third-party platforms when you operate across multiple clouds, need a single pane of glass, require advanced analytics (ML-based anomaly detection, correlation across signals), or want to avoid vendor lock-in on your observability layer. Third-party platforms also tend to have better collaboration features (shared dashboards, annotations, incident timelines) for large teams.

Third-Party Cost Estimation

Third-party observability costs can grow quickly. The following estimates are for a mid-size deployment with 50 hosts, 500 GB logs/month, 10 million trace spans/month, and 5 users:

Platform	Estimated Monthly Cost	Notes
Datadog (Pro)	$2,500–$4,000	$23/host (infra) + $0.10/GB logs + $1.70/100K spans
New Relic	$1,500–$2,500	$0.35/GB ingested (above 100 GB free) + $49/user
Grafana Cloud (Pro)	$1,000–$2,000	$8/1K active series + $0.50/GB logs + $5/trace span/mo
Splunk Observability	$3,000–$5,000	Enterprise pricing per host + data volume
Elastic Cloud	$1,200–$2,000	Per deployment size (compute + storage units)

Observability Cost Can Exceed Infrastructure Cost

For small and mid-size deployments, it is not uncommon for third-party observability costs to approach or even exceed the cost of the infrastructure being monitored. Before committing to a third-party platform, estimate your data volume carefully and negotiate annual contracts for volume discounts. Consider a hybrid approach: use native tools for high-volume, low-value telemetry (e.g., infrastructure metrics) and a third-party platform for application-level observability (APM, traces, business metrics).

Centralized Logging Strategies

In multi-cloud environments, centralized logging is essential for cross-service correlation, compliance auditing, and incident investigation. There are three primary strategies for centralizing logs across providers:

Strategy 1: Third-Party Log Aggregator

Ship logs from all clouds to a single third-party platform (Datadog, Splunk, Elastic, Grafana Loki). This provides the simplest operational model with a single query interface and unified alerting. The downside is cost: ingestion-based pricing at third-party platforms can be expensive at high volume.

Strategy 2: Cloud-Native with Cross-Cloud Export

Use each provider's native logging service but export logs to a central store for cross-cloud analysis. For example, export CloudWatch Logs to S3, Azure Diagnostic Logs to Blob Storage, and Cloud Logging to Cloud Storage or BigQuery, then query them with a unified tool like Athena, Azure Data Explorer, or BigQuery.

Strategy 3: OpenTelemetry-Based Collection

Deploy the OpenTelemetry Collector on all workloads and configure it to send logs to both the native provider service (for real-time debugging) and a central backend (for cross-cloud analysis). This dual-shipping approach provides the best of both worlds but doubles log storage costs.

bash

# AWS: Create a CloudWatch Logs subscription filter to ship logs to S3 via Firehose
aws logs put-subscription-filter \
  --log-group-name /ecs/production/app \
  --filter-name ship-to-firehose \
  --filter-pattern "" \
  --destination-arn arn:aws:firehose:us-east-1:123456789012:deliverystream/logs-to-s3

# Azure: Create a diagnostic setting to export logs to Event Hub (for third-party ingestion)
az monitor diagnostic-settings create \
  --name export-logs \
  --resource /subscriptions/<sub>/resourceGroups/rg-app/providers/Microsoft.Web/sites/myapp \
  --event-hub-rule /subscriptions/<sub>/resourceGroups/rg-shared/providers/Microsoft.EventHub/namespaces/log-hub/authorizationRules/send \
  --logs '[{"category":"AppServiceHTTPLogs","enabled":true},{"category":"AppServiceConsoleLogs","enabled":true}]'

# GCP: Create a log sink to export to BigQuery for long-term analysis
gcloud logging sinks create bq-export-sink \
  bigquery.googleapis.com/projects/my-project/datasets/centralized_logs \
  --log-filter='resource.type="k8s_container" AND severity>=WARNING'

Log Volume and Cost Control

Logging costs can spiral quickly in multi-cloud environments. Implement log sampling for high-volume debug logs, use log-level filtering to exclude verbose entries from centralized stores, and set retention policies that match compliance requirements (not longer). A common pattern: retain info-level logs for 30 days, warning-level for 90 days, and error-level for 1 year. Use structured logging (JSON) to enable efficient querying and reduce the need for full-text search.

Alerting & Incident Management

Effective alerting requires more than threshold-based rules. Modern observability platforms support composite alerts, anomaly detection, and SLO-based alerting (alert when error budgets are burning too fast rather than on raw metric thresholds). Each cloud provider's alerting system has different capabilities:

Capability	AWS CloudWatch	Azure Monitor	GCP Cloud Monitoring
Metric alerts	Static & anomaly detection	Static, dynamic, & multi-resource	Static & MQL-based
Log alerts	Metric filters + alarms	Log alert rules (KQL-based)	Log-based metrics + alerting
Composite alerts	Composite alarms (AND/OR)	Alert processing rules	Alert policies with multiple conditions
SLO-based alerting	ServiceLevelObjective resource	Limited (custom KQL queries)	Native SLO monitoring with burn rate
Notification channels	SNS (email, SMS, Lambda, Slack)	Action groups (email, SMS, webhook, ITSM)	Notification channels (email, SMS, Slack, PagerDuty, webhook)
Auto-remediation	Lambda via SNS or EventBridge	Logic Apps / Azure Functions	Cloud Functions via Pub/Sub
Incident management	AWS Incident Manager	Azure Monitor ITSM connector	Google Cloud IRM (preview)

Multi-Cloud Alerting Strategy

For multi-cloud environments, centralize alerting through one of these approaches:

Third-party platform: Use Datadog, PagerDuty, or Opsgenie as the central alerting hub. Route all cloud-native alerts to the platform via webhooks or native integrations.
PagerDuty / Opsgenie: Use a dedicated incident management platform to aggregate alerts from all providers and manage on-call schedules, escalation policies, and runbooks.
Grafana Alerting: Use Grafana as a unified dashboard and alerting layer with data sources for CloudWatch, Azure Monitor, and Cloud Monitoring.

Alerting Configuration Example

The following example shows how to configure a Grafana alert rule that queries metrics from all three cloud providers simultaneously, enabling a single alert definition that covers your entire multi-cloud deployment:

grafana-multi-cloud-alert.yaml

# Grafana provisioned alert rule for multi-cloud error rate monitoring
apiVersion: 1
groups:
  - orgId: 1
    name: multi-cloud-api-health
    folder: Production Alerts
    interval: 1m
    rules:
      - uid: multi-cloud-error-rate
        title: "API Error Rate > 5% (Any Cloud)"
        condition: C
        data:
          # AWS CloudWatch data source
          - refId: A
            datasourceUid: cloudwatch-ds
            model:
              namespace: AWS/ApplicationELB
              metricName: HTTPCode_Target_5XX_Count
              statistic: Sum
              period: "300"
              dimensions:
                LoadBalancer: ["app/prod-alb/abc123"]
          # Azure Monitor data source
          - refId: B
            datasourceUid: azuremonitor-ds
            model:
              azureMonitor:
                resourceGroup: rg-app
                metricDefinition: Microsoft.Web/sites
                metricName: Http5xx
                timeGrain: PT5M
                aggregation: Total
          # GCP Cloud Monitoring data source
          - refId: C
            datasourceUid: stackdriver-ds
            model:
              metricType: loadbalancing.googleapis.com/https/request_count
              filters:
                - response_code_class
                - "500"
        noDataState: NoData
        execErrState: Error
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API error rate exceeds 5% on one or more cloud providers"
          runbook_url: "https://wiki.example.com/runbooks/api-error-rate"

SLO Monitoring Across Clouds

Service Level Objectives (SLOs) provide a framework for measuring reliability that is independent of the underlying infrastructure. Define SLOs based on user-facing metrics (availability, latency, throughput) and track error budgets across all cloud providers. GCP Cloud Monitoring has the most mature native SLO support with burn rate alerting. For multi-cloud SLO tracking, consider:

Nobl9: A dedicated SLO platform that integrates with all three cloud providers and third-party observability tools
Grafana SLO: Part of Grafana Cloud, provides SLO tracking with multi-data-source support
Sloth: Open-source SLO generator for Prometheus that works with any managed Prometheus service
Custom implementation: Use OpenTelemetry metrics with custom SLI calculations exported to a unified dashboard

Best Practices & Unified Observability

Building a unified observability strategy across multiple clouds requires intentional architecture decisions. The following best practices apply regardless of which tools you choose:

Instrumentation Standards

Adopt OpenTelemetry: Use OpenTelemetry SDKs for application instrumentation across all services. This ensures portable telemetry that works with any backend.
Structured logging: Use JSON-formatted logs with consistent field names (e.g., trace_id, span_id, service_name, environment) across all services and clouds.
Consistent naming: Establish naming conventions for metrics, log groups, and traces that include the cloud provider, environment, and service name (e.g., aws.prod.api-gateway).
Correlation IDs: Propagate trace context (W3C Trace Context headers) across all service boundaries, including cross-cloud calls.

Operational Guidelines

SLO-driven alerting: Define SLOs for each service and alert on error budget burn rate rather than raw thresholds. This reduces alert noise and focuses on user impact.
Dashboard hierarchy: Create a three-level dashboard hierarchy: executive (business KPIs), service (per-service health), and debug (detailed metrics and logs for incident investigation).
Cost governance: Monitor observability spend across all providers monthly. Set up alerts when log ingestion or metric cardinality exceeds thresholds. Use sampling for high-volume, low-value telemetry.
Runbooks: Attach runbooks to every alert. Include cross-cloud investigation procedures that reference the correct console, CLI commands, and log groups for each provider.

Start with OpenTelemetry, Decide on Backend Later

If you are starting a new multi-cloud project, instrument everything with OpenTelemetry from day one. Send telemetry to each provider's native service initially (it is free or low-cost). When you need cross-cloud visibility, add a third-party exporter to your OTel Collector configuration without changing any application code. This approach gives you maximum flexibility with minimal upfront investment.

Related Resources

Dive deeper into provider-specific observability with these guides:

Key Takeaways

1All three providers offer integrated metrics, logging, and tracing, but with different architectures.
2AWS CloudWatch is the most tightly integrated with the broadest AWS service coverage.
3Azure Monitor with KQL provides the most powerful query language for log analytics.
4GCP Cloud Operations offers the best integration with open-source tools and Prometheus.
5OpenTelemetry provides vendor-neutral instrumentation that works across all providers and third-party platforms.
6Third-party platforms (Datadog, New Relic, Grafana Cloud) offer unified dashboards across all clouds.

Frequently Asked Questions

Should I use native monitoring or a third-party tool?

For single-cloud environments, native tools are usually sufficient and cheaper. For multi-cloud or hybrid environments, third-party tools (Datadog, Grafana Cloud, New Relic) provide unified dashboards and correlation across providers. OpenTelemetry as an instrumentation layer gives you flexibility to switch backends.

What is OpenTelemetry?

OpenTelemetry (OTel) is a CNCF open-source observability framework providing APIs, SDKs, and the Collector for generating, collecting, and exporting telemetry data (metrics, logs, traces). It is vendor-neutral and supported by all three cloud providers and major observability platforms.

How does cost compare across providers?

GCP offers 50 GB/month free log ingestion. AWS CloudWatch charges $0.50/GB for log ingestion with 5 GB free. Azure charges $2.76/GB with the first 5 GB free. Custom metrics: AWS $0.30/metric, Azure $0.258/metric, GCP first 150 MB free. Third-party tools are typically more expensive but provide more features.

Which has the best alerting system?

All three have capable alerting. Azure Monitor with action groups provides the most flexible routing. AWS CloudWatch has composite alarms for complex conditions. GCP has SLO-based alerting natively. For advanced alerting across clouds, PagerDuty or Opsgenie integrate with all three.

Can I centralize logs from multiple clouds?

Yes. Options include: (1) OpenTelemetry Collector routing to a single backend, (2) Cloud-to-cloud log forwarding (e.g., AWS to GCP via Pub/Sub), (3) Third-party platforms (Datadog, Splunk, Elastic) that have integrations for all providers, (4) Self-hosted Grafana Loki or Elasticsearch.

Written by CloudToolStack Editorial

Written and reviewed by the CloudToolStack editorial team. Every guide is verified against current provider documentation and revised in place when providers change pricing, deprecate services, or release meaningfully better alternatives.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.