Skip to main content
AWSObservabilityintermediate

CloudWatch & Observability Guide

Master AWS CloudWatch metrics, alarms, Logs Insights, X-Ray tracing, dashboards, and cross-account observability.

CloudToolStack Team24 min readPublished Feb 22, 2026

Prerequisites

  • Basic understanding of AWS services (EC2, Lambda, S3)
  • Familiarity with JSON and AWS CLI
  • AWS account with CloudWatch access

Why Observability Matters on AWS

Observability is more than monitoring; it is the ability to understand the internal state of your systems by examining their external outputs. While monitoring tells youwhen something is broken, observability tells you why it broke and helps you discover problems you did not know to look for. On AWS, observability is built around three pillars: metrics (CloudWatch Metrics), logs (CloudWatch Logs), and traces (AWS X-Ray). Together, they give you the full picture of how your applications behave in production.

Modern cloud architectures, including microservices, serverless functions, and event-driven pipelines, are inherently distributed. A single user request might traverse an API Gateway, invoke a Lambda function, query DynamoDB, publish to SNS, and trigger another Lambda downstream. Without proper observability, debugging a slow response or a failed request becomes a guessing game across dozens of services.

AWS provides a comprehensive observability stack natively through CloudWatch, X-Ray, and related services. While third-party tools like Datadog, New Relic, and Grafana Cloud offer additional capabilities, CloudWatch is deeply integrated with every AWS service and often the most cost-effective starting point. This guide covers how to leverage the full CloudWatch ecosystem, from basic metrics to advanced anomaly detection and cross-account observability.

The Three Pillars of Observability

Metrics are numeric measurements collected over time (e.g., CPU utilization, request count, error rate). Logs are immutable, timestamped records of discrete events (e.g., application errors, access logs, audit trails). Tracesfollow a single request as it traverses multiple services, showing latency at each hop. Effective observability requires all three pillars working together: metrics tell you something is wrong, logs tell you what happened, and traces tell you where the bottleneck is.

CloudWatch Metrics & Custom Metrics

CloudWatch Metrics are the backbone of AWS monitoring. Every AWS service automatically publishes metrics to CloudWatch. EC2 instances report CPU utilization, Lambda functions report invocation count and duration, RDS instances report read/write IOPS, and so on. These are called vended metrics or default metrics, and they are available at no extra charge (with standard resolution of 1-minute or 5-minute intervals depending on the service).

Each metric is identified by a namespace (e.g., AWS/EC2), a metric name (e.g., CPUUtilization), and zero or more dimensions (e.g.,InstanceId=i-1234567890abcdef0). Dimensions act as filters: they let you slice a metric by specific resource, region, or custom attribute. Understanding this namespace/metric/dimension model is essential for building effective dashboards and alarms.

Publishing Custom Metrics

While AWS services publish hundreds of metrics automatically, your applications often need to track business-specific metrics: orders processed per minute, payment failures, cache hit ratios, or queue depth trends. CloudWatch custom metrics let you publish any numeric data point using the PutMetricData API.

bash
# Publish a custom metric via the AWS CLI
aws cloudwatch put-metric-data \
  --namespace "MyApp/Production" \
  --metric-name "OrdersProcessed" \
  --value 42 \
  --unit Count \
  --dimensions Environment=Production,Service=OrderProcessor

# Publish a custom metric with high-resolution (1-second granularity)
aws cloudwatch put-metric-data \
  --namespace "MyApp/Production" \
  --metric-name "PaymentLatencyMs" \
  --value 156.3 \
  --unit Milliseconds \
  --storage-resolution 1 \
  --dimensions Environment=Production,Service=PaymentGateway

Embedded Metrics Format (EMF)

The most efficient way to publish custom metrics from Lambda, ECS, or EC2 is through the CloudWatch Embedded Metrics Format. Instead of making separate API calls toPutMetricData, you embed metric definitions inside structured log events. CloudWatch automatically extracts the metrics without additional API calls or cost for the extraction itself.

emf-example.js
const { createMetricsLogger, Unit } = require("aws-embedded-metrics");

exports.handler = async (event) => {
  const metrics = createMetricsLogger();

  // Set dimensions for the metric
  metrics.setDimensions({ Service: "OrderProcessor", Environment: "Production" });

  // Record custom metrics
  metrics.putMetric("OrdersProcessed", 1, Unit.Count);
  metrics.putMetric("ProcessingLatency", 156.3, Unit.Milliseconds);
  metrics.putMetric("OrderValue", 49.99, Unit.None);

  // Add searchable properties (not dimensions)
  metrics.setProperty("OrderId", "ord-12345");
  metrics.setProperty("CustomerId", "cust-67890");

  // Flush metrics - they are emitted as structured log events
  await metrics.flush();

  return { statusCode: 200, body: "Order processed" };
};

EMF vs PutMetricData

Embedded Metrics Format is strongly preferred over direct PutMetricData API calls in Lambda functions. EMF has zero additional latency (metrics are extracted asynchronously from logs), while PutMetricData adds API call latency to your function execution. EMF also lets you include high-cardinality properties (like OrderId) as searchable log fields without creating expensive high-cardinality dimensions.

Metric Resolution and Retention

ResolutionGranularityRetentionCost
Standard60 seconds15 months (aggregated)$0.30 per metric per month
High-resolution1 second3 hours at 1s, then aggregated$0.30 per metric per month + higher alarm cost
Vended (AWS default)1 or 5 minutes15 months (aggregated)Free (included with service)

CloudWatch retains metric data according to a rolling aggregation schedule: 1-second data points are available for 3 hours, 60-second data points for 15 days, 5-minute data points for 63 days, and 1-hour data points for 455 days (15 months). Data is not deleted; it is aggregated into coarser granularity over time.

CloudWatch Alarms & Composite Alarms

CloudWatch Alarms watch a single metric over a specified time period and trigger actions when the metric crosses a threshold. Alarms are the bridge between observability and automated response: they can send notifications via SNS, trigger Auto Scaling policies, execute SSM Automation runbooks, or invoke Lambda functions.

Each alarm has three states: OK (the metric is within the threshold),ALARM (the metric has breached the threshold), andINSUFFICIENT_DATA (not enough data points to evaluate the alarm). You configure the evaluation period (how many consecutive data points to evaluate), the datapoints to alarm (how many of those periods must breach the threshold), and the comparison operator.

cloudwatch-alarm.yaml
# CloudFormation - High CPU alarm with SNS notification
Resources:
  HighCPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: high-cpu-production-web
      AlarmDescription: "CPU utilization exceeds 80% for 5 minutes"
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: AutoScalingGroupName
          Value: !Ref WebServerASG
      Statistic: Average
      Period: 60
      EvaluationPeriods: 5
      DatapointsToAlarm: 3
      Threshold: 80
      ComparisonOperator: GreaterThanOrEqualToThreshold
      TreatMissingData: missing
      AlarmActions:
        - !Ref AlertSNSTopic
        - !Ref ScaleUpPolicy
      OKActions:
        - !Ref AlertSNSTopic

  # Composite alarm - only fires if BOTH CPU and memory are high
  HighResourceAlarm:
    Type: AWS::CloudWatch::CompositeAlarm
    Properties:
      AlarmName: high-resource-utilization
      AlarmRule: >-
        ALARM("high-cpu-production-web")
        AND
        ALARM("high-memory-production-web")
      AlarmActions:
        - !Ref CriticalAlertSNSTopic
      AlarmDescription: "Both CPU and memory are elevated - likely resource exhaustion"

Composite Alarms

Composite alarms combine multiple alarms using boolean logic (AND, OR, NOT) to reduce alert noise. Instead of getting paged for every individual metric alarm, you can create a composite alarm that only fires when a meaningful combination of conditions is true. For example, a single high CPU alarm might be a normal traffic spike, but high CPU combined with high memory and elevated error rates likely indicates a real problem that needs human attention.

Composite alarms support suppression: you can configure a "wait period" before the composite alarm transitions to ALARM state, giving transient spikes time to resolve. You can also use composite alarms hierarchically: a top-level "service health" alarm can combine lower-level composite alarms for different subsystems.

Alarm Cost Awareness

Standard-resolution alarms cost $0.10/alarm/month, while high-resolution alarms (evaluating at 10-second periods) cost $0.30/alarm/month. Composite alarms are free. In large environments with thousands of resources, alarm costs can add up. Use composite alarms to reduce the total number of metric alarms needed, and use metric math to combine related metrics into a single alarm where possible.

CloudWatch Logs & Logs Insights

CloudWatch Logs is the centralized log management service on AWS. Logs from Lambda functions, ECS containers, EC2 instances (via CloudWatch Agent), API Gateway, VPC Flow Logs, and dozens of other services flow into CloudWatch Logs automatically or with minimal configuration. Logs are organized into log groups (typically one per application or service) and log streams (typically one per instance, container, or function invocation).

Each log event has a timestamp and a message. The message can be unstructured text, but structured JSON logging is strongly recommended because it enables powerful querying with Logs Insights and allows CloudWatch to extract Embedded Metrics Format data automatically.

Structured Logging Best Practices

structured-logging.py
import json
import logging
import os

# Configure structured JSON logging for Lambda
logger = logging.getLogger()
logger.setLevel(logging.INFO)

class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "message": record.getMessage(),
            "service": os.environ.get("SERVICE_NAME", "unknown"),
            "function_name": os.environ.get("AWS_LAMBDA_FUNCTION_NAME", "local"),
            "request_id": getattr(record, "request_id", "N/A"),
            "trace_id": os.environ.get("_X_AMZN_TRACE_ID", "N/A"),
        }
        if record.exc_info:
            log_entry["exception"] = self.formatException(record.exc_info)
        return json.dumps(log_entry)

handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.handlers = [handler]

def lambda_handler(event, context):
    logger.info("Processing order",
                extra={"request_id": context.aws_request_id})

    try:
        order_id = event.get("order_id")
        logger.info(f"Order {order_id} processing started",
                    extra={"request_id": context.aws_request_id})
        # ... process order ...
        logger.info(f"Order {order_id} completed successfully",
                    extra={"request_id": context.aws_request_id})
    except Exception as e:
        logger.error(f"Order processing failed: {str(e)}",
                     extra={"request_id": context.aws_request_id},
                     exc_info=True)
        raise

CloudWatch Logs Insights Queries

Logs Insights is a powerful, purpose-built query language for searching and analyzing log data at scale. It can scan gigabytes of log data in seconds, supports aggregations, filters, and pattern matching, and can query across multiple log groups simultaneously. Logs Insights is billed per GB of data scanned, making it cost-effective for ad-hoc queries compared to running persistent analytics infrastructure.

logs-insights-queries.sql
# Find the 25 most recent errors with stack traces
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 25

# Calculate error rate by service over time
filter level = "ERROR"
| stats count(*) as error_count by bin(5m) as time_bucket, service
| sort time_bucket desc

# Find slowest Lambda invocations
filter @type = "REPORT"
| stats max(@duration) as max_duration,
        avg(@duration) as avg_duration,
        percentile(@duration, 99) as p99_duration
  by bin(1h)
| sort max_duration desc

# Identify cold starts and their impact
filter @type = "REPORT"
| fields @duration, @initDuration, @billedDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count(*) as cold_starts,
        avg(@initDuration) as avg_init_ms,
        max(@initDuration) as max_init_ms,
        avg(@duration) as avg_duration_ms
  by bin(1h)

# Parse and analyze JSON-structured application logs
parse @message '{"timestamp":*,"level":"*","message":"*","service":"*","order_id":"*"}'
  as timestamp, level, msg, service, order_id
| filter level = "ERROR"
| stats count(*) as errors by service
| sort errors desc

# Find the most common error messages (pattern analysis)
filter @message like /ERROR/
| pattern @message
| sort count desc
| limit 20

Logs Insights Cost Optimization

Logs Insights charges $0.005 per GB of data scanned. To minimize costs, always scope your queries to specific log groups rather than querying all groups. Use time range filters aggressively, since scanning 1 hour of logs costs far less than scanning 30 days. Structure your logs as JSON so Logs Insights can discover fields automatically without expensiveparse commands.

Log Retention and Storage Classes

Retention SettingStorage Cost (per GB)Use Case
1 day$0.50 ingestion onlyDevelopment/debugging logs
30 days$0.50 ingestion + $0.03/GB storageShort-term operational logs
90 days$0.50 ingestion + $0.03/GB storageStandard operational retention
1 year$0.50 ingestion + $0.03/GB storageCompliance requirements
Never expire$0.50 ingestion + $0.03/GB storageAudit logs; consider S3 export instead
Infrequent Access class$0.25 ingestion + $0.01/GB storageLogs you rarely query but must retain

AWS X-Ray Distributed Tracing

AWS X-Ray provides distributed tracing across your AWS architecture. When a request enters your system, X-Ray assigns it a unique trace ID that propagates through every service the request touches. Each service records a segment (a unit of work), and services can record subsegments for individual operations like database queries or HTTP calls. The result is a visual map showing exactly how long each step took and where failures or bottlenecks occurred.

X-Ray integrates natively with API Gateway, Lambda, ECS, EC2, Elastic Beanstalk, SNS, SQS, and DynamoDB. For Lambda, enabling X-Ray is a single toggle; AWS automatically instruments the Lambda service segment. For deeper application-level tracing, you add the X-Ray SDK to instrument outbound HTTP calls, database queries, and custom operations.

xray-instrumentation.ts
import AWSXRay from "aws-xray-sdk-core";
import AWS from "aws-sdk";
import https from "https";

// Instrument the AWS SDK - all AWS service calls are automatically traced
AWSXRay.captureAWS(AWS);

// Instrument outbound HTTP calls
AWSXRay.captureHTTPsGlobal(https);

const dynamodb = new AWS.DynamoDB.DocumentClient();
const s3 = new AWS.S3();

export const handler = async (event: any) => {
  // Create a custom subsegment for business logic
  const segment = AWSXRay.getSegment();
  const subsegment = segment?.addNewSubsegment("ProcessOrder");

  try {
    // Add metadata and annotations for searchability
    subsegment?.addAnnotation("orderId", event.orderId);
    subsegment?.addAnnotation("customerId", event.customerId);
    subsegment?.addMetadata("orderDetails", event);

    // DynamoDB call is automatically traced
    const order = await dynamodb.get({
      TableName: "Orders",
      Key: { orderId: event.orderId },
    }).promise();

    // S3 call is automatically traced
    await s3.putObject({
      Bucket: "order-receipts",
      Key: `${event.orderId}/receipt.json`,
      Body: JSON.stringify(order.Item),
    }).promise();

    subsegment?.addAnnotation("status", "success");
    return { statusCode: 200, body: "Order processed" };
  } catch (error) {
    subsegment?.addError(error as Error);
    throw error;
  } finally {
    subsegment?.close();
  }
};

X-Ray Sampling Rules

X-Ray uses sampling to reduce the volume (and cost) of traces recorded. The default sampling rule traces the first request each second and 5% of additional requests. For production systems with high traffic, you should customize sampling rules to ensure adequate coverage for important paths while controlling costs.

sampling-rules.json
{
  "SamplingRule": {
    "RuleName": "critical-api-paths",
    "Priority": 100,
    "FixedRate": 0.1,
    "ReservoirSize": 10,
    "ServiceName": "order-service",
    "ServiceType": "*",
    "Host": "*",
    "HTTPMethod": "POST",
    "URLPath": "/api/orders/*",
    "ResourceARN": "*",
    "Version": 1
  }
}

The ReservoirSize defines how many requests per second are guaranteed to be traced (the reservoir). The FixedRate defines the percentage of additional requests beyond the reservoir that are sampled. A reservoir of 10 with a fixed rate of 0.1 means X-Ray traces 10 requests per second plus 10% of anything beyond that.

CloudWatch Dashboards & Visualization

CloudWatch Dashboards are customizable home pages in the CloudWatch console that you can use to monitor resources in a single view. Dashboards are global: they can include metrics from any AWS region and any account (with cross-account sharing). Each dashboard consists of widgets: metric graphs (line, stacked area, number, gauge), log table widgets, alarm status widgets, and text widgets for documentation.

Dashboards cost $3.00 per dashboard per month (first three are free). For most teams, creating a small number of well-designed dashboards is more effective than creating dozens of narrowly scoped ones. A good pattern is to have a top-level "service health" dashboard that provides an at-a-glance view of all services, with drill-down dashboards for each critical service.

dashboard.yaml
# CloudFormation dashboard definition
Resources:
  ServiceDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: production-overview
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "x": 0, "y": 0, "width": 12, "height": 6,
              "properties": {
                "title": "API Request Rate",
                "metrics": [
                  ["AWS/ApiGateway", "Count", "ApiName", "ProductionAPI",
                   {"stat": "Sum", "period": 60}]
                ],
                "view": "timeSeries",
                "region": "${AWS::Region}"
              }
            },
            {
              "type": "metric",
              "x": 12, "y": 0, "width": 12, "height": 6,
              "properties": {
                "title": "Error Rate (%)",
                "metrics": [
                  [{"expression": "errors/total*100", "label": "Error Rate"}],
                  ["AWS/ApiGateway", "5XXError", "ApiName", "ProductionAPI",
                   {"id": "errors", "stat": "Sum", "visible": false}],
                  ["AWS/ApiGateway", "Count", "ApiName", "ProductionAPI",
                   {"id": "total", "stat": "Sum", "visible": false}]
                ],
                "view": "timeSeries",
                "yAxis": {"left": {"min": 0, "max": 100}}
              }
            },
            {
              "type": "alarm",
              "x": 0, "y": 6, "width": 24, "height": 3,
              "properties": {
                "title": "Alarm Status",
                "alarms": [
                  "arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-error-rate",
                  "arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-latency",
                  "arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-cpu"
                ]
              }
            },
            {
              "type": "log",
              "x": 0, "y": 9, "width": 24, "height": 6,
              "properties": {
                "title": "Recent Errors",
                "query": "SOURCE '/aws/lambda/order-processor' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
                "region": "${AWS::Region}",
                "view": "table"
              }
            }
          ]
        }

Metric Math and Derived Metrics

CloudWatch Metric Math lets you create calculated metrics from existing metrics without publishing new custom metrics. This is powerful for computing ratios, percentages, and per-unit rates directly in dashboards and alarms. Common patterns include error rate (errors divided by total requests), cache hit ratio, and per-instance metrics in an Auto Scaling group.

Anomaly Detection & AI-Powered Insights

CloudWatch Anomaly Detection uses machine learning to continuously analyze metrics and create a model of expected behavior. The model accounts for hourly, daily, and weekly patterns. For example, it learns that CPU is higher during business hours and lower at night. When a metric deviates significantly from the expected band, an anomaly detection alarm fires.

Unlike static threshold alarms, anomaly detection alarms adapt automatically to changing patterns. If your traffic gradually increases over months, the expected band shifts upward. This eliminates the constant tuning of static thresholds that plagues traditional monitoring setups.

bash
# Create an anomaly detection alarm via CLI
aws cloudwatch put-anomaly-detector \
  --namespace "AWS/ApplicationELB" \
  --metric-name "TargetResponseTime" \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890 \
  --stat "Average"

# Create an alarm using the anomaly detection band
aws cloudwatch put-metric-alarm \
  --alarm-name "latency-anomaly-detection" \
  --evaluation-periods 3 \
  --datapoints-to-alarm 2 \
  --comparison-operator GreaterThanUpperThreshold \
  --threshold-metric-id "ad1" \
  --metrics '[
    {
      "Id": "m1",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "TargetResponseTime",
          "Dimensions": [
            {"Name": "LoadBalancer", "Value": "app/my-alb/1234567890"}
          ]
        },
        "Period": 300,
        "Stat": "Average"
      }
    },
    {
      "Id": "ad1",
      "Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
    }
  ]' \
  --alarm-actions "arn:aws:sns:us-east-1:123456789012:alerts"

CloudWatch Application Insights

CloudWatch Application Insights provides automated dashboards and anomaly detection for .NET, SQL Server, IIS, and other common application stacks. It automatically discovers application components (EC2 instances, RDS databases, ELBs, Lambda functions) and creates pre-configured monitors for common problems like memory leaks, SQL connection pool exhaustion, and HTTP 500 errors.

CloudWatch Contributor Insights

Contributor Insights helps you identify the top contributors to a metric or log pattern. For example, you can find the top 10 IP addresses making the most API calls, the top 10 Lambda functions with the highest error rate, or the top 10 DynamoDB partition keys receiving the most throttled requests. This is invaluable for identifying hot partitions, abusive clients, or misconfigured services.

CloudWatch Logs Anomaly Detection

In addition to metric anomaly detection, CloudWatch offers Log Anomaly Detection that automatically identifies unusual patterns in log data without you having to define rules. It uses machine learning to baseline your normal log patterns and alerts when it detects deviations, such as a sudden increase in a specific error message, a new error type appearing for the first time, or a significant change in log volume. This is enabled per log group and requires no configuration beyond turning it on.

Cross-Account & Cross-Region Observability

Most production AWS environments span multiple accounts (following the AWS Organizations multi-account strategy) and multiple regions (for disaster recovery or latency optimization). CloudWatch cross-account observability lets you designate one account as themonitoring account and connect source accounts to it. The monitoring account can then view metrics, logs, and traces from all connected source accounts in a single pane of glass.

This is configured through CloudWatch Observability Access Manager (OAM). You create an OAM sink in the monitoring account and OAM links in each source account. Once linked, the monitoring account can query cross-account metrics in dashboards, run Logs Insights queries across accounts, and view X-Ray traces that span multiple accounts.

cross-account-oam.yaml
# Monitoring account - create the OAM sink
Resources:
  ObservabilitySink:
    Type: AWS::Oam::Sink
    Properties:
      Name: central-monitoring-sink
      Policy:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              AWS:
                - "111111111111"  # Source account 1
                - "222222222222"  # Source account 2
                - "333333333333"  # Source account 3
            Action:
              - "oam:CreateLink"
              - "oam:UpdateLink"
            Resource: "*"
            Condition:
              ForAllValues:StringEquals:
                oam:ResourceTypes:
                  - "AWS::CloudWatch::Metric"
                  - "AWS::Logs::LogGroup"
                  - "AWS::XRay::Trace"

---
# Source account - create the OAM link
Resources:
  ObservabilityLink:
    Type: AWS::Oam::Link
    Properties:
      LabelTemplate: "$AccountName"
      ResourceTypes:
        - "AWS::CloudWatch::Metric"
        - "AWS::Logs::LogGroup"
        - "AWS::XRay::Trace"
      SinkIdentifier: "arn:aws:oam:us-east-1:000000000000:sink/sink-id"

Cross-Region Dashboard Strategy

CloudWatch dashboards are global by nature, so a single dashboard can display widgets from any region. This lets you build a multi-region overview dashboard that shows the health of your application in all deployed regions side by side. Combine this with cross-account observability to create a single dashboard that monitors a multi-account, multi-region deployment from one location.

Cross-Region Data Transfer Costs

When you query metrics or logs from a different region in a dashboard, the data is fetched from the source region. This incurs standard AWS cross-region data transfer costs. For dashboards that are viewed frequently with auto-refresh enabled, these costs can accumulate. Consider creating regional dashboards for day-to-day use and reserving the global dashboard for incident response when you need the full picture.

Cost Optimization for Observability

CloudWatch costs can grow significantly in large environments, especially log ingestion and storage. Understanding the cost model and applying optimization strategies is essential to getting maximum observability value without budget surprises. The three biggest cost drivers are: log ingestion ($0.50/GB), custom metrics ($0.30/metric/month for the first 10,000), and Logs Insights queries ($0.005/GB scanned).

Cost Breakdown by Component

ComponentPricingOptimization Strategy
Log Ingestion$0.50/GB (standard), $0.25/GB (infrequent access)Filter at source, use IA class for low-query logs
Log Storage$0.03/GB/month (standard), $0.01/GB/month (IA)Set retention policies, export old logs to S3
Custom Metrics$0.30/metric/month (first 10K)Reduce dimensions, use EMF properties instead
Dashboards$3.00/dashboard/month (first 3 free)Consolidate dashboards, use drill-down pattern
Alarms$0.10/alarm/month (standard)Use composite alarms, metric math to reduce count
Logs Insights$0.005/GB scannedScope queries tightly, use time range filters
X-Ray Traces$5.00/million traces recordedTune sampling rules, focus on critical paths
Contributor Insights$0.02/rule/event matchedLimit to high-value use cases

Log Ingestion Reduction Strategies

bash
# Create a subscription filter to drop verbose debug logs before ingestion
# (Note: this filters log events for a destination, not for storage)
# Instead, adjust log levels at the application level:

# For Lambda - set LOG_LEVEL environment variable
aws lambda update-function-configuration \
  --function-name my-function \
  --environment "Variables={LOG_LEVEL=WARN}"

# Move low-query log groups to Infrequent Access class
aws logs put-log-group \
  --log-group-name "/aws/lambda/batch-processor" \
  --log-group-class INFREQUENT_ACCESS

# Set retention to avoid indefinite storage growth
aws logs put-retention-policy \
  --log-group-name "/aws/lambda/my-function" \
  --retention-in-days 30

# Export old logs to S3 for cheaper long-term storage
aws logs create-export-task \
  --log-group-name "/aws/lambda/my-function" \
  --from 1609459200000 \
  --to 1612137600000 \
  --destination "my-log-archive-bucket" \
  --destination-prefix "lambda/my-function"

The 80/20 Rule of Log Costs

In most environments, 80% of log volume comes from 20% of log groups. Use theaws logs describe-log-groups command with --query to identify your largest log groups by stored bytes. Often, a single verbose service (like a busy API or a chatty batch processor) dominates your log costs. Addressing those few high-volume sources has far more impact than trying to optimize every log group.

Custom Metric Cost Control

Custom metrics cost $0.30/metric/month for the first 10,000, dropping to $0.10 at higher volumes. Each unique combination of namespace + metric name + dimensions counts as a separate metric. A common mistake is using high-cardinality dimensions (like user IDs or request IDs) that create thousands of unique metrics. Instead, use EMF properties for high-cardinality data (they are searchable in logs but do not create metrics) and reserve dimensions for low-cardinality attributes like environment, service name, and region.

Best Practices & Common Patterns

Building effective observability on AWS requires combining the right tools with disciplined practices. The following patterns represent lessons learned from production environments of various scales.

The Golden Signals

Start with Google's four golden signals for every service: Latency (how long requests take), Traffic (how many requests per second),Errors (the rate of failed requests), and Saturation (how full your system is). On AWS, these map to specific metrics per service:

SignalLambda MetricALB MetricDynamoDB Metric
LatencyDuration (p50, p99)TargetResponseTimeSuccessfulRequestLatency
TrafficInvocationsRequestCountConsumedReadCapacityUnits
ErrorsErrors, ThrottlesHTTPCode_Target_5XXSystemErrors, UserErrors
SaturationConcurrentExecutionsActiveConnectionCountConsumedWriteCapacityUnits vs Provisioned

Observability Maturity Model

Level 1, Reactive: Basic CloudWatch alarms on individual resources (CPU, memory). You find out about problems when alarms fire or users complain. Logs exist but are unstructured and rarely queried.

Level 2, Proactive: Structured JSON logging, custom metrics for business KPIs, dashboards for golden signals. You can diagnose most issues within minutes using Logs Insights queries.

Level 3, Predictive: X-Ray distributed tracing across all services, anomaly detection alarms, Contributor Insights for hot-spot identification. You often detect and fix problems before users notice them.

Level 4, Autonomous: Cross-account observability with centralized monitoring, automated remediation via EventBridge + Lambda/SSM, runbooks for common failure modes, chaos engineering to validate observability coverage.

Start Simple, Iterate

Do not try to implement Level 4 observability on day one. Start with structured logging and golden signal dashboards (Level 2), then add tracing and anomaly detection as your team matures. The most common mistake is building sophisticated observability infrastructure that nobody actually uses. Focus on the tools and dashboards your on-call engineers actually look at during incidents, and expand from there.

Alerting Best Practices

Every alarm should have a clear owner, a documented runbook, and a defined severity level. Alarms that fire frequently without actionable response (alert fatigue) are worse than having no alarms, because they train your team to ignore alerts. Use composite alarms to reduce noise, set appropriate evaluation periods to avoid false positives from transient spikes, and regularly review alarm history to tune thresholds.

Route alarms to appropriate channels based on severity: critical alarms (service down) go to PagerDuty/OpsGenie for immediate human response, warning alarms (degraded performance) go to a Slack channel for team awareness, and informational alarms (approaching thresholds) go to email or a low-priority queue. Never page for something that can wait until morning.

Lambda Performance Tuning: Observing and Optimizing Lambda FunctionsMulti-Cloud Observability: Comparing AWS, Azure & GCP Monitoring

Key Takeaways

  1. 1CloudWatch Metrics, Logs, and Alarms form the foundation of AWS observability.
  2. 2Custom metrics and Embedded Metric Format (EMF) enable application-specific monitoring.
  3. 3CloudWatch Logs Insights provides powerful SQL-like queries for log analysis.
  4. 4AWS X-Ray enables distributed tracing across microservices and serverless applications.
  5. 5Cross-account observability with CloudWatch OAM centralizes monitoring in multi-account setups.
  6. 6Anomaly detection uses machine learning to automatically identify unexpected metric behavior.

Frequently Asked Questions

What is the difference between CloudWatch Metrics and CloudWatch Logs?
CloudWatch Metrics are time-series numerical data points (CPU utilization, request count) used for dashboards and alarms. CloudWatch Logs store text-based log data from applications and services. You can create metric filters to extract metrics from log data.
How does AWS X-Ray differ from CloudWatch?
X-Ray is specifically for distributed tracing and tracks requests as they flow through multiple services. CloudWatch provides metrics, logs, and alarms. They complement each other: X-Ray shows request paths, while CloudWatch monitors resource health.
How much does CloudWatch cost?
CloudWatch offers a free tier with 10 custom metrics, 5 GB log ingestion, 3 dashboards, and 10 alarms. Beyond that, costs are per metric ($0.30/metric/month), per GB of log data ingested ($0.50/GB), and per alarm ($0.10/alarm/month).
Can I monitor multiple AWS accounts from one dashboard?
Yes. CloudWatch Observability Access Manager (OAM) lets you create cross-account links, allowing a central monitoring account to view metrics, logs, and traces from multiple source accounts in a single dashboard.
What is CloudWatch Anomaly Detection?
Anomaly Detection applies machine learning to your metrics to create a model of expected behavior. It automatically identifies anomalies without requiring you to set static thresholds. It works with both standard and custom metrics.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.