CloudWatch & Observability Guide
Master AWS CloudWatch metrics, alarms, Logs Insights, X-Ray tracing, dashboards, and cross-account observability.
Prerequisites
- Basic understanding of AWS services (EC2, Lambda, S3)
- Familiarity with JSON and AWS CLI
- AWS account with CloudWatch access
Why Observability Matters on AWS
Observability is more than monitoring; it is the ability to understand the internal state of your systems by examining their external outputs. While monitoring tells youwhen something is broken, observability tells you why it broke and helps you discover problems you did not know to look for. On AWS, observability is built around three pillars: metrics (CloudWatch Metrics), logs (CloudWatch Logs), and traces (AWS X-Ray). Together, they give you the full picture of how your applications behave in production.
Modern cloud architectures, including microservices, serverless functions, and event-driven pipelines, are inherently distributed. A single user request might traverse an API Gateway, invoke a Lambda function, query DynamoDB, publish to SNS, and trigger another Lambda downstream. Without proper observability, debugging a slow response or a failed request becomes a guessing game across dozens of services.
AWS provides a comprehensive observability stack natively through CloudWatch, X-Ray, and related services. While third-party tools like Datadog, New Relic, and Grafana Cloud offer additional capabilities, CloudWatch is deeply integrated with every AWS service and often the most cost-effective starting point. This guide covers how to leverage the full CloudWatch ecosystem, from basic metrics to advanced anomaly detection and cross-account observability.
The Three Pillars of Observability
Metrics are numeric measurements collected over time (e.g., CPU utilization, request count, error rate). Logs are immutable, timestamped records of discrete events (e.g., application errors, access logs, audit trails). Tracesfollow a single request as it traverses multiple services, showing latency at each hop. Effective observability requires all three pillars working together: metrics tell you something is wrong, logs tell you what happened, and traces tell you where the bottleneck is.
CloudWatch Metrics & Custom Metrics
CloudWatch Metrics are the backbone of AWS monitoring. Every AWS service automatically publishes metrics to CloudWatch. EC2 instances report CPU utilization, Lambda functions report invocation count and duration, RDS instances report read/write IOPS, and so on. These are called vended metrics or default metrics, and they are available at no extra charge (with standard resolution of 1-minute or 5-minute intervals depending on the service).
Each metric is identified by a namespace (e.g., AWS/EC2), a metric name (e.g., CPUUtilization), and zero or more dimensions (e.g.,InstanceId=i-1234567890abcdef0). Dimensions act as filters: they let you slice a metric by specific resource, region, or custom attribute. Understanding this namespace/metric/dimension model is essential for building effective dashboards and alarms.
Publishing Custom Metrics
While AWS services publish hundreds of metrics automatically, your applications often need to track business-specific metrics: orders processed per minute, payment failures, cache hit ratios, or queue depth trends. CloudWatch custom metrics let you publish any numeric data point using the PutMetricData API.
# Publish a custom metric via the AWS CLI
aws cloudwatch put-metric-data \
--namespace "MyApp/Production" \
--metric-name "OrdersProcessed" \
--value 42 \
--unit Count \
--dimensions Environment=Production,Service=OrderProcessor
# Publish a custom metric with high-resolution (1-second granularity)
aws cloudwatch put-metric-data \
--namespace "MyApp/Production" \
--metric-name "PaymentLatencyMs" \
--value 156.3 \
--unit Milliseconds \
--storage-resolution 1 \
--dimensions Environment=Production,Service=PaymentGatewayEmbedded Metrics Format (EMF)
The most efficient way to publish custom metrics from Lambda, ECS, or EC2 is through the CloudWatch Embedded Metrics Format. Instead of making separate API calls toPutMetricData, you embed metric definitions inside structured log events. CloudWatch automatically extracts the metrics without additional API calls or cost for the extraction itself.
const { createMetricsLogger, Unit } = require("aws-embedded-metrics");
exports.handler = async (event) => {
const metrics = createMetricsLogger();
// Set dimensions for the metric
metrics.setDimensions({ Service: "OrderProcessor", Environment: "Production" });
// Record custom metrics
metrics.putMetric("OrdersProcessed", 1, Unit.Count);
metrics.putMetric("ProcessingLatency", 156.3, Unit.Milliseconds);
metrics.putMetric("OrderValue", 49.99, Unit.None);
// Add searchable properties (not dimensions)
metrics.setProperty("OrderId", "ord-12345");
metrics.setProperty("CustomerId", "cust-67890");
// Flush metrics - they are emitted as structured log events
await metrics.flush();
return { statusCode: 200, body: "Order processed" };
};EMF vs PutMetricData
Embedded Metrics Format is strongly preferred over direct PutMetricData API calls in Lambda functions. EMF has zero additional latency (metrics are extracted asynchronously from logs), while PutMetricData adds API call latency to your function execution. EMF also lets you include high-cardinality properties (like OrderId) as searchable log fields without creating expensive high-cardinality dimensions.
Metric Resolution and Retention
| Resolution | Granularity | Retention | Cost |
|---|---|---|---|
| Standard | 60 seconds | 15 months (aggregated) | $0.30 per metric per month |
| High-resolution | 1 second | 3 hours at 1s, then aggregated | $0.30 per metric per month + higher alarm cost |
| Vended (AWS default) | 1 or 5 minutes | 15 months (aggregated) | Free (included with service) |
CloudWatch retains metric data according to a rolling aggregation schedule: 1-second data points are available for 3 hours, 60-second data points for 15 days, 5-minute data points for 63 days, and 1-hour data points for 455 days (15 months). Data is not deleted; it is aggregated into coarser granularity over time.
CloudWatch Alarms & Composite Alarms
CloudWatch Alarms watch a single metric over a specified time period and trigger actions when the metric crosses a threshold. Alarms are the bridge between observability and automated response: they can send notifications via SNS, trigger Auto Scaling policies, execute SSM Automation runbooks, or invoke Lambda functions.
Each alarm has three states: OK (the metric is within the threshold),ALARM (the metric has breached the threshold), andINSUFFICIENT_DATA (not enough data points to evaluate the alarm). You configure the evaluation period (how many consecutive data points to evaluate), the datapoints to alarm (how many of those periods must breach the threshold), and the comparison operator.
# CloudFormation - High CPU alarm with SNS notification
Resources:
HighCPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: high-cpu-production-web
AlarmDescription: "CPU utilization exceeds 80% for 5 minutes"
Namespace: AWS/EC2
MetricName: CPUUtilization
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref WebServerASG
Statistic: Average
Period: 60
EvaluationPeriods: 5
DatapointsToAlarm: 3
Threshold: 80
ComparisonOperator: GreaterThanOrEqualToThreshold
TreatMissingData: missing
AlarmActions:
- !Ref AlertSNSTopic
- !Ref ScaleUpPolicy
OKActions:
- !Ref AlertSNSTopic
# Composite alarm - only fires if BOTH CPU and memory are high
HighResourceAlarm:
Type: AWS::CloudWatch::CompositeAlarm
Properties:
AlarmName: high-resource-utilization
AlarmRule: >-
ALARM("high-cpu-production-web")
AND
ALARM("high-memory-production-web")
AlarmActions:
- !Ref CriticalAlertSNSTopic
AlarmDescription: "Both CPU and memory are elevated - likely resource exhaustion"Composite Alarms
Composite alarms combine multiple alarms using boolean logic (AND, OR, NOT) to reduce alert noise. Instead of getting paged for every individual metric alarm, you can create a composite alarm that only fires when a meaningful combination of conditions is true. For example, a single high CPU alarm might be a normal traffic spike, but high CPU combined with high memory and elevated error rates likely indicates a real problem that needs human attention.
Composite alarms support suppression: you can configure a "wait period" before the composite alarm transitions to ALARM state, giving transient spikes time to resolve. You can also use composite alarms hierarchically: a top-level "service health" alarm can combine lower-level composite alarms for different subsystems.
Alarm Cost Awareness
Standard-resolution alarms cost $0.10/alarm/month, while high-resolution alarms (evaluating at 10-second periods) cost $0.30/alarm/month. Composite alarms are free. In large environments with thousands of resources, alarm costs can add up. Use composite alarms to reduce the total number of metric alarms needed, and use metric math to combine related metrics into a single alarm where possible.
CloudWatch Logs & Logs Insights
CloudWatch Logs is the centralized log management service on AWS. Logs from Lambda functions, ECS containers, EC2 instances (via CloudWatch Agent), API Gateway, VPC Flow Logs, and dozens of other services flow into CloudWatch Logs automatically or with minimal configuration. Logs are organized into log groups (typically one per application or service) and log streams (typically one per instance, container, or function invocation).
Each log event has a timestamp and a message. The message can be unstructured text, but structured JSON logging is strongly recommended because it enables powerful querying with Logs Insights and allows CloudWatch to extract Embedded Metrics Format data automatically.
Structured Logging Best Practices
import json
import logging
import os
# Configure structured JSON logging for Lambda
logger = logging.getLogger()
logger.setLevel(logging.INFO)
class JsonFormatter(logging.Formatter):
def format(self, record):
log_entry = {
"timestamp": self.formatTime(record),
"level": record.levelname,
"message": record.getMessage(),
"service": os.environ.get("SERVICE_NAME", "unknown"),
"function_name": os.environ.get("AWS_LAMBDA_FUNCTION_NAME", "local"),
"request_id": getattr(record, "request_id", "N/A"),
"trace_id": os.environ.get("_X_AMZN_TRACE_ID", "N/A"),
}
if record.exc_info:
log_entry["exception"] = self.formatException(record.exc_info)
return json.dumps(log_entry)
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.handlers = [handler]
def lambda_handler(event, context):
logger.info("Processing order",
extra={"request_id": context.aws_request_id})
try:
order_id = event.get("order_id")
logger.info(f"Order {order_id} processing started",
extra={"request_id": context.aws_request_id})
# ... process order ...
logger.info(f"Order {order_id} completed successfully",
extra={"request_id": context.aws_request_id})
except Exception as e:
logger.error(f"Order processing failed: {str(e)}",
extra={"request_id": context.aws_request_id},
exc_info=True)
raiseCloudWatch Logs Insights Queries
Logs Insights is a powerful, purpose-built query language for searching and analyzing log data at scale. It can scan gigabytes of log data in seconds, supports aggregations, filters, and pattern matching, and can query across multiple log groups simultaneously. Logs Insights is billed per GB of data scanned, making it cost-effective for ad-hoc queries compared to running persistent analytics infrastructure.
# Find the 25 most recent errors with stack traces
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 25
# Calculate error rate by service over time
filter level = "ERROR"
| stats count(*) as error_count by bin(5m) as time_bucket, service
| sort time_bucket desc
# Find slowest Lambda invocations
filter @type = "REPORT"
| stats max(@duration) as max_duration,
avg(@duration) as avg_duration,
percentile(@duration, 99) as p99_duration
by bin(1h)
| sort max_duration desc
# Identify cold starts and their impact
filter @type = "REPORT"
| fields @duration, @initDuration, @billedDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count(*) as cold_starts,
avg(@initDuration) as avg_init_ms,
max(@initDuration) as max_init_ms,
avg(@duration) as avg_duration_ms
by bin(1h)
# Parse and analyze JSON-structured application logs
parse @message '{"timestamp":*,"level":"*","message":"*","service":"*","order_id":"*"}'
as timestamp, level, msg, service, order_id
| filter level = "ERROR"
| stats count(*) as errors by service
| sort errors desc
# Find the most common error messages (pattern analysis)
filter @message like /ERROR/
| pattern @message
| sort count desc
| limit 20Logs Insights Cost Optimization
Logs Insights charges $0.005 per GB of data scanned. To minimize costs, always scope your queries to specific log groups rather than querying all groups. Use time range filters aggressively, since scanning 1 hour of logs costs far less than scanning 30 days. Structure your logs as JSON so Logs Insights can discover fields automatically without expensiveparse commands.
Log Retention and Storage Classes
| Retention Setting | Storage Cost (per GB) | Use Case |
|---|---|---|
| 1 day | $0.50 ingestion only | Development/debugging logs |
| 30 days | $0.50 ingestion + $0.03/GB storage | Short-term operational logs |
| 90 days | $0.50 ingestion + $0.03/GB storage | Standard operational retention |
| 1 year | $0.50 ingestion + $0.03/GB storage | Compliance requirements |
| Never expire | $0.50 ingestion + $0.03/GB storage | Audit logs; consider S3 export instead |
| Infrequent Access class | $0.25 ingestion + $0.01/GB storage | Logs you rarely query but must retain |
AWS X-Ray Distributed Tracing
AWS X-Ray provides distributed tracing across your AWS architecture. When a request enters your system, X-Ray assigns it a unique trace ID that propagates through every service the request touches. Each service records a segment (a unit of work), and services can record subsegments for individual operations like database queries or HTTP calls. The result is a visual map showing exactly how long each step took and where failures or bottlenecks occurred.
X-Ray integrates natively with API Gateway, Lambda, ECS, EC2, Elastic Beanstalk, SNS, SQS, and DynamoDB. For Lambda, enabling X-Ray is a single toggle; AWS automatically instruments the Lambda service segment. For deeper application-level tracing, you add the X-Ray SDK to instrument outbound HTTP calls, database queries, and custom operations.
import AWSXRay from "aws-xray-sdk-core";
import AWS from "aws-sdk";
import https from "https";
// Instrument the AWS SDK - all AWS service calls are automatically traced
AWSXRay.captureAWS(AWS);
// Instrument outbound HTTP calls
AWSXRay.captureHTTPsGlobal(https);
const dynamodb = new AWS.DynamoDB.DocumentClient();
const s3 = new AWS.S3();
export const handler = async (event: any) => {
// Create a custom subsegment for business logic
const segment = AWSXRay.getSegment();
const subsegment = segment?.addNewSubsegment("ProcessOrder");
try {
// Add metadata and annotations for searchability
subsegment?.addAnnotation("orderId", event.orderId);
subsegment?.addAnnotation("customerId", event.customerId);
subsegment?.addMetadata("orderDetails", event);
// DynamoDB call is automatically traced
const order = await dynamodb.get({
TableName: "Orders",
Key: { orderId: event.orderId },
}).promise();
// S3 call is automatically traced
await s3.putObject({
Bucket: "order-receipts",
Key: `${event.orderId}/receipt.json`,
Body: JSON.stringify(order.Item),
}).promise();
subsegment?.addAnnotation("status", "success");
return { statusCode: 200, body: "Order processed" };
} catch (error) {
subsegment?.addError(error as Error);
throw error;
} finally {
subsegment?.close();
}
};X-Ray Sampling Rules
X-Ray uses sampling to reduce the volume (and cost) of traces recorded. The default sampling rule traces the first request each second and 5% of additional requests. For production systems with high traffic, you should customize sampling rules to ensure adequate coverage for important paths while controlling costs.
{
"SamplingRule": {
"RuleName": "critical-api-paths",
"Priority": 100,
"FixedRate": 0.1,
"ReservoirSize": 10,
"ServiceName": "order-service",
"ServiceType": "*",
"Host": "*",
"HTTPMethod": "POST",
"URLPath": "/api/orders/*",
"ResourceARN": "*",
"Version": 1
}
}The ReservoirSize defines how many requests per second are guaranteed to be traced (the reservoir). The FixedRate defines the percentage of additional requests beyond the reservoir that are sampled. A reservoir of 10 with a fixed rate of 0.1 means X-Ray traces 10 requests per second plus 10% of anything beyond that.
CloudWatch Dashboards & Visualization
CloudWatch Dashboards are customizable home pages in the CloudWatch console that you can use to monitor resources in a single view. Dashboards are global: they can include metrics from any AWS region and any account (with cross-account sharing). Each dashboard consists of widgets: metric graphs (line, stacked area, number, gauge), log table widgets, alarm status widgets, and text widgets for documentation.
Dashboards cost $3.00 per dashboard per month (first three are free). For most teams, creating a small number of well-designed dashboards is more effective than creating dozens of narrowly scoped ones. A good pattern is to have a top-level "service health" dashboard that provides an at-a-glance view of all services, with drill-down dashboards for each critical service.
# CloudFormation dashboard definition
Resources:
ServiceDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: production-overview
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "API Request Rate",
"metrics": [
["AWS/ApiGateway", "Count", "ApiName", "ProductionAPI",
{"stat": "Sum", "period": 60}]
],
"view": "timeSeries",
"region": "${AWS::Region}"
}
},
{
"type": "metric",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Error Rate (%)",
"metrics": [
[{"expression": "errors/total*100", "label": "Error Rate"}],
["AWS/ApiGateway", "5XXError", "ApiName", "ProductionAPI",
{"id": "errors", "stat": "Sum", "visible": false}],
["AWS/ApiGateway", "Count", "ApiName", "ProductionAPI",
{"id": "total", "stat": "Sum", "visible": false}]
],
"view": "timeSeries",
"yAxis": {"left": {"min": 0, "max": 100}}
}
},
{
"type": "alarm",
"x": 0, "y": 6, "width": 24, "height": 3,
"properties": {
"title": "Alarm Status",
"alarms": [
"arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-error-rate",
"arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-latency",
"arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:alarm:high-cpu"
]
}
},
{
"type": "log",
"x": 0, "y": 9, "width": 24, "height": 6,
"properties": {
"title": "Recent Errors",
"query": "SOURCE '/aws/lambda/order-processor' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
"region": "${AWS::Region}",
"view": "table"
}
}
]
}Metric Math and Derived Metrics
CloudWatch Metric Math lets you create calculated metrics from existing metrics without publishing new custom metrics. This is powerful for computing ratios, percentages, and per-unit rates directly in dashboards and alarms. Common patterns include error rate (errors divided by total requests), cache hit ratio, and per-instance metrics in an Auto Scaling group.
Anomaly Detection & AI-Powered Insights
CloudWatch Anomaly Detection uses machine learning to continuously analyze metrics and create a model of expected behavior. The model accounts for hourly, daily, and weekly patterns. For example, it learns that CPU is higher during business hours and lower at night. When a metric deviates significantly from the expected band, an anomaly detection alarm fires.
Unlike static threshold alarms, anomaly detection alarms adapt automatically to changing patterns. If your traffic gradually increases over months, the expected band shifts upward. This eliminates the constant tuning of static thresholds that plagues traditional monitoring setups.
# Create an anomaly detection alarm via CLI
aws cloudwatch put-anomaly-detector \
--namespace "AWS/ApplicationELB" \
--metric-name "TargetResponseTime" \
--dimensions Name=LoadBalancer,Value=app/my-alb/1234567890 \
--stat "Average"
# Create an alarm using the anomaly detection band
aws cloudwatch put-metric-alarm \
--alarm-name "latency-anomaly-detection" \
--evaluation-periods 3 \
--datapoints-to-alarm 2 \
--comparison-operator GreaterThanUpperThreshold \
--threshold-metric-id "ad1" \
--metrics '[
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "TargetResponseTime",
"Dimensions": [
{"Name": "LoadBalancer", "Value": "app/my-alb/1234567890"}
]
},
"Period": 300,
"Stat": "Average"
}
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
}
]' \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:alerts"CloudWatch Application Insights
CloudWatch Application Insights provides automated dashboards and anomaly detection for .NET, SQL Server, IIS, and other common application stacks. It automatically discovers application components (EC2 instances, RDS databases, ELBs, Lambda functions) and creates pre-configured monitors for common problems like memory leaks, SQL connection pool exhaustion, and HTTP 500 errors.
CloudWatch Contributor Insights
Contributor Insights helps you identify the top contributors to a metric or log pattern. For example, you can find the top 10 IP addresses making the most API calls, the top 10 Lambda functions with the highest error rate, or the top 10 DynamoDB partition keys receiving the most throttled requests. This is invaluable for identifying hot partitions, abusive clients, or misconfigured services.
CloudWatch Logs Anomaly Detection
In addition to metric anomaly detection, CloudWatch offers Log Anomaly Detection that automatically identifies unusual patterns in log data without you having to define rules. It uses machine learning to baseline your normal log patterns and alerts when it detects deviations, such as a sudden increase in a specific error message, a new error type appearing for the first time, or a significant change in log volume. This is enabled per log group and requires no configuration beyond turning it on.
Cross-Account & Cross-Region Observability
Most production AWS environments span multiple accounts (following the AWS Organizations multi-account strategy) and multiple regions (for disaster recovery or latency optimization). CloudWatch cross-account observability lets you designate one account as themonitoring account and connect source accounts to it. The monitoring account can then view metrics, logs, and traces from all connected source accounts in a single pane of glass.
This is configured through CloudWatch Observability Access Manager (OAM). You create an OAM sink in the monitoring account and OAM links in each source account. Once linked, the monitoring account can query cross-account metrics in dashboards, run Logs Insights queries across accounts, and view X-Ray traces that span multiple accounts.
# Monitoring account - create the OAM sink
Resources:
ObservabilitySink:
Type: AWS::Oam::Sink
Properties:
Name: central-monitoring-sink
Policy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
AWS:
- "111111111111" # Source account 1
- "222222222222" # Source account 2
- "333333333333" # Source account 3
Action:
- "oam:CreateLink"
- "oam:UpdateLink"
Resource: "*"
Condition:
ForAllValues:StringEquals:
oam:ResourceTypes:
- "AWS::CloudWatch::Metric"
- "AWS::Logs::LogGroup"
- "AWS::XRay::Trace"
---
# Source account - create the OAM link
Resources:
ObservabilityLink:
Type: AWS::Oam::Link
Properties:
LabelTemplate: "$AccountName"
ResourceTypes:
- "AWS::CloudWatch::Metric"
- "AWS::Logs::LogGroup"
- "AWS::XRay::Trace"
SinkIdentifier: "arn:aws:oam:us-east-1:000000000000:sink/sink-id"Cross-Region Dashboard Strategy
CloudWatch dashboards are global by nature, so a single dashboard can display widgets from any region. This lets you build a multi-region overview dashboard that shows the health of your application in all deployed regions side by side. Combine this with cross-account observability to create a single dashboard that monitors a multi-account, multi-region deployment from one location.
Cross-Region Data Transfer Costs
When you query metrics or logs from a different region in a dashboard, the data is fetched from the source region. This incurs standard AWS cross-region data transfer costs. For dashboards that are viewed frequently with auto-refresh enabled, these costs can accumulate. Consider creating regional dashboards for day-to-day use and reserving the global dashboard for incident response when you need the full picture.
Cost Optimization for Observability
CloudWatch costs can grow significantly in large environments, especially log ingestion and storage. Understanding the cost model and applying optimization strategies is essential to getting maximum observability value without budget surprises. The three biggest cost drivers are: log ingestion ($0.50/GB), custom metrics ($0.30/metric/month for the first 10,000), and Logs Insights queries ($0.005/GB scanned).
Cost Breakdown by Component
| Component | Pricing | Optimization Strategy |
|---|---|---|
| Log Ingestion | $0.50/GB (standard), $0.25/GB (infrequent access) | Filter at source, use IA class for low-query logs |
| Log Storage | $0.03/GB/month (standard), $0.01/GB/month (IA) | Set retention policies, export old logs to S3 |
| Custom Metrics | $0.30/metric/month (first 10K) | Reduce dimensions, use EMF properties instead |
| Dashboards | $3.00/dashboard/month (first 3 free) | Consolidate dashboards, use drill-down pattern |
| Alarms | $0.10/alarm/month (standard) | Use composite alarms, metric math to reduce count |
| Logs Insights | $0.005/GB scanned | Scope queries tightly, use time range filters |
| X-Ray Traces | $5.00/million traces recorded | Tune sampling rules, focus on critical paths |
| Contributor Insights | $0.02/rule/event matched | Limit to high-value use cases |
Log Ingestion Reduction Strategies
# Create a subscription filter to drop verbose debug logs before ingestion
# (Note: this filters log events for a destination, not for storage)
# Instead, adjust log levels at the application level:
# For Lambda - set LOG_LEVEL environment variable
aws lambda update-function-configuration \
--function-name my-function \
--environment "Variables={LOG_LEVEL=WARN}"
# Move low-query log groups to Infrequent Access class
aws logs put-log-group \
--log-group-name "/aws/lambda/batch-processor" \
--log-group-class INFREQUENT_ACCESS
# Set retention to avoid indefinite storage growth
aws logs put-retention-policy \
--log-group-name "/aws/lambda/my-function" \
--retention-in-days 30
# Export old logs to S3 for cheaper long-term storage
aws logs create-export-task \
--log-group-name "/aws/lambda/my-function" \
--from 1609459200000 \
--to 1612137600000 \
--destination "my-log-archive-bucket" \
--destination-prefix "lambda/my-function"The 80/20 Rule of Log Costs
In most environments, 80% of log volume comes from 20% of log groups. Use theaws logs describe-log-groups command with --query to identify your largest log groups by stored bytes. Often, a single verbose service (like a busy API or a chatty batch processor) dominates your log costs. Addressing those few high-volume sources has far more impact than trying to optimize every log group.
Custom Metric Cost Control
Custom metrics cost $0.30/metric/month for the first 10,000, dropping to $0.10 at higher volumes. Each unique combination of namespace + metric name + dimensions counts as a separate metric. A common mistake is using high-cardinality dimensions (like user IDs or request IDs) that create thousands of unique metrics. Instead, use EMF properties for high-cardinality data (they are searchable in logs but do not create metrics) and reserve dimensions for low-cardinality attributes like environment, service name, and region.
Best Practices & Common Patterns
Building effective observability on AWS requires combining the right tools with disciplined practices. The following patterns represent lessons learned from production environments of various scales.
The Golden Signals
Start with Google's four golden signals for every service: Latency (how long requests take), Traffic (how many requests per second),Errors (the rate of failed requests), and Saturation (how full your system is). On AWS, these map to specific metrics per service:
| Signal | Lambda Metric | ALB Metric | DynamoDB Metric |
|---|---|---|---|
| Latency | Duration (p50, p99) | TargetResponseTime | SuccessfulRequestLatency |
| Traffic | Invocations | RequestCount | ConsumedReadCapacityUnits |
| Errors | Errors, Throttles | HTTPCode_Target_5XX | SystemErrors, UserErrors |
| Saturation | ConcurrentExecutions | ActiveConnectionCount | ConsumedWriteCapacityUnits vs Provisioned |
Observability Maturity Model
Level 1, Reactive: Basic CloudWatch alarms on individual resources (CPU, memory). You find out about problems when alarms fire or users complain. Logs exist but are unstructured and rarely queried.
Level 2, Proactive: Structured JSON logging, custom metrics for business KPIs, dashboards for golden signals. You can diagnose most issues within minutes using Logs Insights queries.
Level 3, Predictive: X-Ray distributed tracing across all services, anomaly detection alarms, Contributor Insights for hot-spot identification. You often detect and fix problems before users notice them.
Level 4, Autonomous: Cross-account observability with centralized monitoring, automated remediation via EventBridge + Lambda/SSM, runbooks for common failure modes, chaos engineering to validate observability coverage.
Start Simple, Iterate
Do not try to implement Level 4 observability on day one. Start with structured logging and golden signal dashboards (Level 2), then add tracing and anomaly detection as your team matures. The most common mistake is building sophisticated observability infrastructure that nobody actually uses. Focus on the tools and dashboards your on-call engineers actually look at during incidents, and expand from there.
Alerting Best Practices
Every alarm should have a clear owner, a documented runbook, and a defined severity level. Alarms that fire frequently without actionable response (alert fatigue) are worse than having no alarms, because they train your team to ignore alerts. Use composite alarms to reduce noise, set appropriate evaluation periods to avoid false positives from transient spikes, and regularly review alarm history to tune thresholds.
Route alarms to appropriate channels based on severity: critical alarms (service down) go to PagerDuty/OpsGenie for immediate human response, warning alarms (degraded performance) go to a Slack channel for team awareness, and informational alarms (approaching thresholds) go to email or a low-priority queue. Never page for something that can wait until morning.
Lambda Performance Tuning: Observing and Optimizing Lambda FunctionsMulti-Cloud Observability: Comparing AWS, Azure & GCP MonitoringKey Takeaways
- 1CloudWatch Metrics, Logs, and Alarms form the foundation of AWS observability.
- 2Custom metrics and Embedded Metric Format (EMF) enable application-specific monitoring.
- 3CloudWatch Logs Insights provides powerful SQL-like queries for log analysis.
- 4AWS X-Ray enables distributed tracing across microservices and serverless applications.
- 5Cross-account observability with CloudWatch OAM centralizes monitoring in multi-account setups.
- 6Anomaly detection uses machine learning to automatically identify unexpected metric behavior.
Frequently Asked Questions
What is the difference between CloudWatch Metrics and CloudWatch Logs?
How does AWS X-Ray differ from CloudWatch?
How much does CloudWatch cost?
Can I monitor multiple AWS accounts from one dashboard?
What is CloudWatch Anomaly Detection?
Written by CloudToolStack Team
Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.
Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.