Building an Observability Stack: CloudWatch vs Azure Monitor vs Cloud Ops vs OCI Logging

The Three Pillars of Observability

Observability is the ability to understand the internal state of your system by examining its outputs: metrics, logs, and traces. Metrics tell you that something is wrong (CPU is at 95 percent, error rate is 5x normal). Logs tell you what went wrong (a specific error message, a stack trace, a request that failed validation). Traces tell you where it went wrong in a distributed system (the request was slow because the payment service took 3 seconds to respond to the order service). Together, these three pillars give you the information needed to detect, diagnose, and resolve production issues.

Every major cloud provider offers native observability tooling: AWS CloudWatch, Azure Monitor, GCP Cloud Operations (formerly Stackdriver), and OCI Logging and Monitoring. These tools are deeply integrated with their respective platforms, require minimal setup for basic functionality, and are the most cost-effective option for teams that operate exclusively on one cloud. But they differ significantly in capability, user experience, query languages, alerting sophistication, and cost. This article compares the native observability stacks across all four providers to help you understand what each offers and where the gaps are.

Metrics: Collection, Storage, and Querying

Metrics are numerical measurements collected at regular intervals: CPU utilization, request count, error rate, latency percentiles, queue depth. Every cloud service automatically emits metrics to its native monitoring platform, and you can publish custom metrics from your applications.

AWS CloudWatch collects metrics from over 90 AWS services automatically at 5-minute granularity for free (1-minute granularity with detailed monitoring at $0.30/metric/month for the first 10,000 metrics). Custom metrics cost $0.30/metric/month. CloudWatch metrics are queried using the CloudWatch Metrics Insights query language or the simpler metric math expressions. Dashboard widgets visualize metrics with graphs, numbers, and gauges. CloudWatch supports composite alarms that combine multiple alarms with AND/OR logic, and anomaly detection alarms that use machine learning to detect unusual patterns without setting static thresholds.

Azure Monitor collects platform metrics from all Azure services at 1-minute granularity for free, with 93 days of retention. Custom metrics are sent through the Application Insights SDK or the Azure Monitor OpenTelemetry distro. Azure Monitor uses Kusto Query Language (KQL) for querying metrics stored in Log Analytics workspaces. KQL is a powerful query language with support for aggregation, time-series analysis, joins, and rendering functions. Azure Workbooks provide rich, interactive dashboards that combine metrics, logs, and text in parameterized templates.

GCP Cloud Monitoring collects metrics from all GCP services automatically. Custom metrics are sent through the Cloud Monitoring API or OpenTelemetry SDK. GCP supports Monitoring Query Language (MQL) and PromQL for querying metrics, and it can ingest Prometheus metrics natively through Google Cloud Managed Service for Prometheus. This Prometheus compatibility is a significant advantage for teams already using Prometheus, as it eliminates the need to run and manage a Prometheus server. Cloud Monitoring dashboards support charts, alerts, and SLO tracking.

OCI Monitoring collects metrics from all OCI services with 1-minute granularity. Metrics are retained for 90 days for raw data points and up to 3 years for hourly aggregates. Custom metrics can be published through the Monitoring API. OCI uses MQL (Monitoring Query Language) for querying, which supports basic aggregation and filtering. The OCI Console provides pre-built service dashboards and custom dashboard creation. OCI's metrics capabilities are functional but less sophisticated than the other three providers in terms of query language and visualization.

Build CloudWatch dashboards Build GCP Cloud Monitoring dashboards

Logging: Ingestion, Search, and Analysis

Logs are the most voluminous observability data type and often the most expensive to store and query. A moderately busy application can generate gigabytes of logs per day, and without proper management, log storage costs can exceed compute costs within months.

CloudWatch Logs ingests logs from EC2 instances (via the CloudWatch agent), Lambda functions (automatically), ECS containers (via awslogs driver), and any application using the CloudWatch Logs API. Ingestion costs $0.50 per GB. Storage costs $0.03 per GB per month. Log Insights provides a purpose-built query language for searching and analyzing logs with support for aggregation, filtering, pattern detection, and visualization. Log Insights queries are charged at $0.005 per GB of data scanned, which is reasonable but can add up for frequent ad-hoc queries across large log volumes.

Azure Log Analytics, part of Azure Monitor, is one of the most powerful cloud-native log analysis platforms. It uses KQL, which is arguably the most capable log query language among the four providers. KQL supports joins, regular expressions, time-series analysis, machine learning operators, and rendering. Log ingestion costs approximately $2.76 per GB (with commitment tiers reducing this to $1.96/GB at 200 GB/day). Data retention is free for 31 days with paid retention up to 730 days. The cost per GB is higher than CloudWatch, but KQL's power reduces the time and effort needed to extract insights. Azure also offers Basic Logs at $0.50/GB ingestion (reduced from standard) with limited query capabilities for high-volume, low-value logs like verbose debug output.

GCP Cloud Logging automatically ingests logs from all GCP services. The first 50 GB per project per month is free, and additional ingestion costs $0.50 per GB. Logs can be queried using the Logs Explorer with a filter-based query syntax, or exported to BigQuery for SQL-based analysis. The BigQuery integration is GCP's strongest differentiator for log analysis: exporting logs to BigQuery lets you run complex analytical queries, join log data with other datasets, and retain logs indefinitely at BigQuery storage rates ($0.01-$0.02/GB/month for long-term storage). For organizations that need sophisticated log analytics, the GCP Logging plus BigQuery combination is difficult to beat.

OCI Logging collects logs from OCI services and custom sources. Logs are ingested into the Logging service and can be searched using the OCI Console or exported to Object Storage for long-term retention. OCI Logging Analytics provides more advanced log analysis with machine learning-based pattern detection, anomaly identification, and correlation. Logging Analytics ingestion is priced at approximately $0.60 per GB. The tool is functional but has a smaller community and fewer integrations than the other three providers.

Log cost control

Set explicit retention periods on all log groups — never use "never expire." Export logs to object storage for long-term retention at a fraction of the cost. Use log levels effectively: reduce verbose DEBUG logging in production to INFO or WARN. Sample high-volume logs rather than ingesting every request. These practices typically reduce log costs by 60-80%.

Distributed Tracing

Distributed tracing follows a request as it travels through multiple services, recording the time spent in each service and the relationships between service calls. Traces are essential for debugging latency issues, understanding service dependencies, and identifying performance bottlenecks in microservices architectures.

AWS X-Ray provides distributed tracing across Lambda, API Gateway, ECS, EKS, EC2, and other AWS services. X-Ray traces are recorded as segments and subsegments, and the X-Ray console provides a service map showing service dependencies and latency. X-Ray supports sampling rules to control the volume of traces captured. Tracing costs $5.00 per million traces recorded and $0.50 per million traces retrieved. AWS also supports the OpenTelemetry standard through the AWS Distro for OpenTelemetry (ADOT), which provides a vendor-neutral instrumentation approach.

Azure Application Insights provides distributed tracing as part of the broader Application Insights APM offering. Application Insights automatically instruments .NET, Java, Node.js, and Python applications to capture traces, and supports OpenTelemetry for vendor-neutral instrumentation. The Application Map visualizes service dependencies and highlights performance issues. Application Insights is priced based on data ingestion into Log Analytics at $2.76/GB, and traces contribute to this volume. For high-traffic applications, use adaptive sampling to reduce trace volume while maintaining statistical accuracy.

GCP Cloud Trace provides distributed tracing with automatic instrumentation for Google Cloud services. Cloud Trace supports OpenTelemetry and Zipkin-compatible trace formats. The first 2.5 million trace spans per month are free, with additional spans at $0.20 per million. Cloud Trace integrates with Cloud Monitoring for trace-based alerting and with Cloud Logging for correlating traces with log entries. The integration between Trace, Logging, and Monitoring in GCP is particularly well-done, allowing you to navigate from a trace span to the corresponding log entries and metrics with a single click.

OCI Application Performance Monitoring (APM) provides distributed tracing across OCI services and custom applications. APM supports OpenTelemetry for instrumentation. The Always Free tier includes 1,000 tracing events per hour. Paid tracing costs approximately $0.65 per 100 trace spans. APM also includes synthetic monitoring for proactive availability testing and browser monitoring for client-side performance tracking.

Configure X-Ray sampling rules AWS CloudWatch Observability Guide

Alerting and Incident Response

Effective alerting is the bridge between observability data and human action. Good alerts are actionable (the recipient knows what to investigate), contextual (the alert includes enough information to begin diagnosis), and appropriately urgent (critical issues wake people up; informational issues create tickets). Bad alerts are noisy, vague, and fire too frequently, leading to alert fatigue and missed critical incidents.

CloudWatch Alarms support static thresholds, anomaly detection, composite alarms, and metric math expressions. Alarms can trigger SNS notifications, Lambda functions, Auto Scaling actions, and EC2 actions (stop, terminate, reboot). For incident management, CloudWatch integrates with AWS Systems Manager Incident Manager, which provides runbook automation, escalation policies, and post-incident analysis. CloudWatch alarms cost $0.10 per alarm per month (standard) or $0.30 per alarm per month (high resolution).

Azure Monitor Alerts support metric alerts, log alerts, activity log alerts, and smart detection alerts. Azure Action Groups define notification channels (email, SMS, webhook, Azure Function, Logic App, ITSM connector) and can be shared across multiple alerts. Azure Monitor also integrates with Azure Sentinel for SIEM and SOAR capabilities. Alert rules cost approximately $0.10 per monitored signal per month for metric alerts.

GCP Cloud Monitoring alerting supports metric threshold alerts, absence conditions (alert when a metric stops being reported), and forecasting-based alerts (alert when a metric is projected to exceed a threshold). Notification channels include email, SMS, PagerDuty, Slack, webhooks, and Pub/Sub. GCP's SLO-based alerting is a standout feature: you define Service Level Objectives (e.g., 99.9% availability, P99 latency under 200ms), and GCP alerts when error budget burn rate exceeds sustainable levels.

OCI Monitoring Alarms support static thresholds on any metric with configurable evaluation periods and notification topics. Alarms trigger OCI Notifications, which can send to email, Slack, PagerDuty, webhooks, and OCI Functions. OCI's alerting is straightforward and functional but lacks the advanced features like anomaly detection and SLO-based alerting available on the other platforms.

Dashboards and Visualization

Dashboards provide at-a-glance visibility into system health. Each provider offers dashboard capabilities, but the sophistication and usability vary significantly.

CloudWatch Dashboards support line charts, stacked areas, numbers, gauges, text widgets, log insights results, and alarm status. Dashboards can include metrics from multiple accounts and regions in a single view. Cross-account dashboards are available through CloudWatch cross-account observability. Dashboards cost $3.00 per month per dashboard.

Azure Workbooks are the most powerful native dashboard tool among the four providers. They support parameterized templates, conditional rendering, interactive elements (clicking a row in a table filters related charts), and integration with metrics, logs, and external data sources. Azure Dashboards provide a simpler pinboard-style view with tiles. Both are included at no additional cost.

GCP Cloud Monitoring dashboards support line charts, stacked areas, heatmaps, tables, scorecards, and SLO status widgets. Dashboards can include metrics from multiple projects. The SLO and error budget widgets are unique to GCP and provide immediate visibility into service reliability. Dashboards are free.

OCI dashboards are available through the OCI Console and provide basic visualization with charts and tables for metrics and alarms. For more sophisticated visualization, many OCI users export data to Grafana, which is available as an OCI-managed service.

OpenTelemetry and Vendor Neutrality

OpenTelemetry (OTel) is the CNCF standard for observability instrumentation, providing vendor-neutral APIs and SDKs for metrics, logs, and traces. All four cloud providers support OpenTelemetry to varying degrees, and adopting OTel as your instrumentation layer provides portability between observability backends.

AWS supports OTel through the AWS Distro for OpenTelemetry (ADOT), which is a distribution of the OpenTelemetry Collector and SDKs with AWS-specific exporters. ADOT can send traces to X-Ray, metrics to CloudWatch, and logs to CloudWatch Logs. Azure supports OTel through the Azure Monitor OpenTelemetry distro for .NET, Java, JavaScript, and Python. GCP provides the strongest OTel support, with native ingestion of OTel metrics, traces, and logs without requiring a proprietary SDK. OCI supports OTel through the APM service's trace and metric ingestion endpoints.

For teams that want to avoid cloud-specific instrumentation, instrument your applications using OpenTelemetry SDKs and use the OpenTelemetry Collector to export data to your cloud provider's native tools. This approach provides vendor neutrality at the application layer while leveraging the deep integrations and cost advantages of native observability tools.

Choosing Your Observability Strategy

For single-cloud deployments, use the native observability tools. They offer the deepest integration, the lowest latency for data availability, and the most cost-effective pricing for standard use cases. CloudWatch is solid for AWS-only shops. Azure Monitor with Application Insights provides the richest feature set. GCP Cloud Operations has the best trace-log-metric correlation and Prometheus compatibility. OCI Logging and Monitoring covers the basics well.

For multi-cloud or hybrid environments, consider third-party observability platforms (Datadog, Grafana Cloud, Elastic Observability, New Relic) that provide a unified view across all your cloud environments. The additional cost is justified by the operational simplicity of a single observability platform rather than managing separate tools for each cloud.

Regardless of your tooling choice, invest in structured logging (JSON format with consistent field names), distributed tracing (OpenTelemetry or provider SDK), and SLO-based alerting (define reliability targets and alert on error budget consumption). These practices improve your ability to detect and resolve incidents regardless of which specific observability tools you use.

Start with these three

Enable native metrics collection on all cloud services (free on all providers). Set up structured JSON logging with 30-day retention. Implement distributed tracing on your most critical request path. These three actions give you baseline observability coverage. You can expand and optimize from there.

Compare monitoring tools across clouds Compare logging services across clouds Multi-Cloud Observability Comparison Guide Azure Monitor and Application Insights Guide