Skip to main content
GCPObservabilityintermediate

Cloud Logging & Monitoring Guide

Master GCP Cloud Logging, Cloud Monitoring, Cloud Trace, log-based metrics, uptime checks, alerting policies, and dashboard design.

CloudToolStack Team26 min readPublished Feb 22, 2026

Prerequisites

  • Basic understanding of GCP services
  • Familiarity with gcloud CLI
  • Experience with application development

GCP Observability Overview

Google Cloud's observability stack, formerly known as Stackdriver, provides a unified platform for collecting, analyzing, and acting on telemetry data across your entire cloud infrastructure. The suite includes Cloud Logging for log management, Cloud Monitoring for metrics and alerting, Cloud Trace for distributed tracing, Cloud Profiler for continuous production profiling, and Error Reporting for automatic exception grouping. Together, these services give you complete visibility into application health, performance, and behavior without needing to deploy third-party agents or manage dedicated observability infrastructure.

What makes GCP's observability stack particularly powerful is its deep integration with every GCP service. Compute Engine instances automatically emit system metrics, Cloud Run services generate request-level traces, and GKE clusters produce both node and pod-level telemetry out of the box. This native integration eliminates the bootstrapping problem that plagues third-party monitoring solutions, where getting the first metric or log line often requires hours of configuration.

The Google Cloud Operations suite operates on three pillars of observability:

PillarServicePurposeData Type
LogsCloud LoggingCentralized log ingestion, storage, and analysisStructured & unstructured log entries
MetricsCloud MonitoringTime-series metrics collection and visualizationNumeric measurements with timestamps
TracesCloud TraceDistributed request tracing across servicesSpans with timing and context
ProfilingCloud ProfilerContinuous CPU and memory profilingStatistical sampling profiles
ErrorsError ReportingAutomatic error grouping and notificationStack traces and exception metadata

Multi-Project Observability

Cloud Monitoring supports metrics scopes (formerly Stackdriver workspaces) that can aggregate data from up to 375 monitored projects into a single pane of glass. You designate one project as the scoping project, then add other projects as monitored projects. This is essential for organizations with dozens or hundreds of projects that need centralized dashboards and alerting policies.

Cloud Logging Architecture

Cloud Logging ingests log entries from every GCP service, on-premises systems, and other clouds through a scalable, fully managed pipeline. Logs arrive as structured JSON entries with a well-defined LogEntry schema that includes the log payload, resource metadata, severity level, timestamp, and labels. Every log entry is associated with a monitored resource type such as gce_instance, cloud_run_revision, or k8s_container, which determines how the entry is categorized and queried.

Logs are organized into log buckets, which are the storage containers for log data within a project. Every project has two default buckets: _Required (which stores Admin Activity and System Event audit logs for 400 days with no charge) and _Default (which stores all other ingested logs for 30 days and can be configured for retention up to 3,650 days). You can also create user-defined buckets with custom retention periods and regional storage locations.

The Logging API accepts entries through multiple ingestion paths:

  • Automatic ingestion: GCP services like Compute Engine, Cloud Run, GKE, and App Engine emit logs automatically. No configuration required.
  • Ops Agent: The Cloud Ops Agent (replacing the legacy Logging agent) collects system logs and application logs from Compute Engine VMs. It is built on Fluent Bit for logs and the OpenTelemetry Collector for metrics.
  • Client libraries: Application code writes structured logs using Cloud Logging client libraries available for Go, Java, Node.js, Python, Ruby, C#, and PHP.
  • REST/gRPC API: Direct API calls for custom integrations and third-party systems.
Install and configure the Ops Agent on a Compute Engine VM
# Install the Ops Agent on a single VM
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

# Verify the agent is running
sudo systemctl status google-cloud-ops-agent

# Configure custom log collection (e.g., Nginx access logs)
sudo tee /etc/google-cloud-ops-agent/config.yaml > /dev/null << 'EOF'
logging:
  receivers:
    nginx_access:
      type: nginx_access
      include_paths:
        - /var/log/nginx/access.log
    nginx_error:
      type: nginx_error
      include_paths:
        - /var/log/nginx/error.log
    app_json:
      type: files
      include_paths:
        - /var/log/myapp/*.json
      record_log_file_path: true
  processors:
    parse_json:
      type: parse_json
      time_key: timestamp
      time_format: "%Y-%m-%dT%H:%M:%S.%LZ"
  service:
    pipelines:
      nginx_pipeline:
        receivers:
          - nginx_access
          - nginx_error
      app_pipeline:
        receivers:
          - app_json
        processors:
          - parse_json
metrics:
  receivers:
    nginx:
      type: nginx
      stub_status_url: http://localhost:80/status
  service:
    pipelines:
      nginx_metrics:
        receivers:
          - nginx
EOF

# Restart the agent to apply changes
sudo systemctl restart google-cloud-ops-agent

Writing Structured Logs from Application Code

While plain text logs work, structured logging unlocks the full power of Cloud Logging's query capabilities. Structured log entries include a JSON payload with typed fields that can be indexed, filtered, and aggregated. On GKE and Cloud Run, writing JSON to stdout/stderr automatically creates structured log entries, with special fields like severity, message, and httpRequest being parsed into their corresponding LogEntry fields.

Structured logging in Python with google-cloud-logging
import google.cloud.logging
import logging
import json

# Set up Cloud Logging client
client = google.cloud.logging.Client()
client.setup_logging()

# Standard logging calls are automatically sent to Cloud Logging
logger = logging.getLogger(__name__)

# Simple structured log
logger.info("User logged in", extra={
    "json_fields": {
        "user_id": "user-12345",
        "login_method": "oauth2",
        "ip_address": "203.0.113.42",
        "session_duration_ms": 0
    }
})

# For Cloud Run / GKE, write JSON to stdout for automatic parsing
log_entry = {
    "severity": "WARNING",
    "message": "High latency detected on payment service",
    "httpRequest": {
        "requestMethod": "POST",
        "requestUrl": "/api/v1/payments",
        "status": 200,
        "latency": "2.345s"
    },
    "logging.googleapis.com/labels": {
        "service": "payment-api",
        "environment": "production"
    },
    "logging.googleapis.com/trace": "projects/my-project/traces/abc123def456",
    "logging.googleapis.com/spanId": "span-789"
}
print(json.dumps(log_entry))

Log Router & Sinks

The Log Router is the heart of Cloud Logging's data pipeline. Every log entry that arrives at Cloud Logging passes through the Log Router, which evaluates the entry against a set of configured sinks. Each sink has an inclusion filter (and optional exclusion filters) that determines which log entries it captures, along with a destination where matching entries are exported. The router evaluates all sinks for every log entry, meaning a single entry can be routed to multiple destinations simultaneously.

There are two default sinks that exist in every project:

  • _Required: Captures Admin Activity audit logs, System Event audit logs, and Access Transparency logs. These cannot be modified or disabled. They are stored in the _Required bucket for 400 days at no charge.
  • _Default: Captures all log entries not handled by the _Required sink. By default, these go to the _Default bucket with 30-day retention. You can add exclusion filters to reduce ingestion costs.

Sinks can route logs to four types of destinations:

DestinationUse CaseLatencyCost Considerations
Cloud Logging BucketStandard log storage with Logs Explorer accessNear real-time$0.50/GiB ingestion beyond free tier
Cloud StorageLong-term archival, complianceBatched (hourly)Low storage cost, especially Coldline/Archive
BigQuerySQL-based log analytics, dashboardsStreaming (seconds)BigQuery streaming insert + storage costs
Pub/SubReal-time processing, SIEM integrationReal-timePub/Sub message delivery costs
Create log sinks with gcloud
# Create a sink that exports all error logs to BigQuery
gcloud logging sinks create error-logs-to-bq \
  bigquery.googleapis.com/projects/my-project/datasets/error_logs \
  --log-filter='severity >= ERROR' \
  --description="Export all error-level logs to BigQuery for analysis"

# Create a sink that exports audit logs to Cloud Storage for compliance
gcloud logging sinks create audit-to-gcs \
  storage.googleapis.com/my-audit-logs-bucket \
  --log-filter='logName:"cloudaudit.googleapis.com"' \
  --description="Archive audit logs to GCS for 7-year retention"

# Create a sink to Pub/Sub for real-time SIEM integration
gcloud logging sinks create siem-export \
  pubsub.googleapis.com/projects/my-project/topics/security-logs \
  --log-filter='logName:"cloudaudit.googleapis.com/activity" OR
    resource.type="gce_firewall_rule"' \
  --description="Stream security-relevant logs to Splunk via Pub/Sub"

# Add an exclusion filter to the _Default sink to reduce costs
gcloud logging sinks update _Default \
  --add-exclusion='name=exclude-debug,filter=severity = DEBUG'

# Create an aggregated sink at the organization level
gcloud logging sinks create org-wide-audit-sink \
  bigquery.googleapis.com/projects/central-logging/datasets/org_audit \
  --organization=123456789 \
  --include-children \
  --log-filter='logName:"cloudaudit.googleapis.com"'

# After creating a sink, grant the sink's service account access
# to the destination. Get the writer identity:
gcloud logging sinks describe error-logs-to-bq --format='value(writerIdentity)'
# Then grant the appropriate role to the writer identity on the destination

Sink Service Account Permissions

When you create a sink, Cloud Logging generates a unique service account (the writer identity) that needs permission to write to the destination. For BigQuery destinations, grant roles/bigquery.dataEditor. For Cloud Storage, grant roles/storage.objectCreator. For Pub/Sub, grant roles/pubsub.publisher. Forgetting this step is the single most common reason sinks fail silently. Logs appear to vanish because the sink cannot write to its destination.

Log Analytics with BigQuery

Starting in 2023, Cloud Logging introduced Log Analytics, which lets you run SQL queries directly against log buckets that have been upgraded to use Log Analytics. This gives you BigQuery-compatible SQL access to your logs without needing to export them via a sink. Upgraded buckets create linked BigQuery datasets that can be queried alongside your other BigQuery data. This is particularly powerful for joining log data with business data for root cause analysis.

Upgrade a log bucket for Log Analytics and query it
# Upgrade the _Default bucket to support Log Analytics
gcloud logging buckets update _Default \
  --location=global \
  --enable-analytics

# Create a linked BigQuery dataset
gcloud logging links create my-log-link \
  --bucket=_Default \
  --location=global

# Now query logs using BigQuery SQL
bq query --use_legacy_sql=false '
SELECT
  timestamp,
  severity,
  json_value(json_payload, "$.message") AS message,
  json_value(json_payload, "$.user_id") AS user_id,
  resource.type AS resource_type
FROM
  `my-project.global._Default._AllLogs`
WHERE
  timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
  AND severity = "ERROR"
ORDER BY
  timestamp DESC
LIMIT 100'

Cloud Monitoring Metrics

Cloud Monitoring collects time-series data (metrics) from GCP services, AWS connectors, custom applications, and the Ops Agent. Every metric is identified by a metric type string such as compute.googleapis.com/instance/cpu/utilization, and each time-series data point is associated with a monitored resource (the entity producing the metric) and a set of labels (key-value pairs that further describe the data point).

Metrics fall into several categories:

  • System metrics: Automatically collected by GCP services (CPU utilization, network bytes, disk I/O). These come at no additional cost.
  • Agent metrics: Collected by the Ops Agent from system-level counters and third-party applications (Nginx, MySQL, Redis, etc.). Available at no additional cost when using the agent.
  • Custom metrics: Written by your application code using the Cloud Monitoring API or OpenTelemetry. Charged at $0.258 per 1,000 samples ingested for volumes beyond the free tier of 150 MB.
  • Log-based metrics: Derived from log entries using filters. They turn log data into time-series metrics that can be used in dashboards and alerting policies.
  • External metrics: Collected from AWS CloudWatch through cross-cloud monitoring connectors.

Custom Metrics with OpenTelemetry

The recommended approach for instrumenting custom metrics is using OpenTelemetry with the Google Cloud exporter. OpenTelemetry provides vendor-neutral APIs that let you switch backends without changing your application code. Cloud Monitoring natively supports OpenTelemetry Protocol (OTLP) ingestion, making integration straightforward.

Custom metrics with OpenTelemetry in Python
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.cloud_monitoring import CloudMonitoringMetricsExporter

# Configure the OpenTelemetry metrics pipeline
exporter = CloudMonitoringMetricsExporter(project_id="my-project")
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=60000)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)

# Create a meter and instruments
meter = metrics.get_meter("my-application", version="1.0.0")

# Counter: tracks cumulative values (e.g., total requests)
request_counter = meter.create_counter(
    name="app.requests.total",
    description="Total number of requests processed",
    unit="1"
)

# Histogram: tracks distributions (e.g., latency)
latency_histogram = meter.create_histogram(
    name="app.request.duration",
    description="Request processing duration",
    unit="ms"
)

# UpDownCounter: tracks values that go up and down (e.g., active connections)
active_connections = meter.create_up_down_counter(
    name="app.connections.active",
    description="Number of active connections",
    unit="1"
)

# Record metrics in your application code
def handle_request(request):
    import time
    start = time.time()
    active_connections.add(1, {"endpoint": request.path})

    try:
        response = process_request(request)
        request_counter.add(1, {
            "method": request.method,
            "endpoint": request.path,
            "status": response.status_code
        })
        return response
    finally:
        duration_ms = (time.time() - start) * 1000
        latency_histogram.record(duration_ms, {
            "method": request.method,
            "endpoint": request.path
        })
        active_connections.add(-1, {"endpoint": request.path})

Log-Based Metrics

Log-based metrics bridge the gap between logging and monitoring by converting log entries into time-series metrics. There are two types: counter metrics (which count the number of log entries matching a filter) and distribution metrics (which extract numeric values from log entries and record their statistical distribution). These are particularly valuable for tracking application-specific events without adding custom instrumentation code.

Create log-based metrics
# Create a counter metric that counts 5xx errors from a load balancer
gcloud logging metrics create http-5xx-errors \
  --description="Count of HTTP 5xx responses from load balancer" \
  --log-filter='resource.type="http_load_balancer"
    AND httpRequest.status >= 500'

# Create a distribution metric for response latency
gcloud logging metrics create response-latency \
  --description="Distribution of HTTP response latencies" \
  --log-filter='resource.type="http_load_balancer"' \
  --field-name='httpRequest.latency' \
  --field-type='DISTRIBUTION' \
  --bucket-boundaries='0.1,0.5,1.0,2.0,5.0,10.0'

# Create a counter metric with labels extracted from log fields
gcloud logging metrics create api-errors-by-endpoint \
  --description="API errors grouped by endpoint and status code" \
  --log-filter='resource.type="cloud_run_revision"
    AND severity >= ERROR' \
  --label-extractors='endpoint=EXTRACT(httpRequest.requestUrl),
    status_code=EXTRACT(httpRequest.status)'

Uptime Checks & SLOs

Uptime checks are probes sent from globally distributed Google data centers to verify that your services are reachable and responsive. Cloud Monitoring supports HTTP, HTTPS, and TCP uptime checks, with probes originating from multiple geographic regions simultaneously. If a check fails from a configurable number of regions, it triggers an alerting policy. Uptime checks are the foundation of availability monitoring and SLO tracking.

Each uptime check can validate response status codes, response body content, SSL certificate validity, and latency thresholds. You can target checks at public IP addresses, fully qualified domain names, Cloud Run services, App Engine services, or Compute Engine instance groups. Private uptime checks can probe internal resources accessible only within your VPC by routing through a Serverless VPC Access connector.

Create uptime checks with gcloud and Terraform
# Create an HTTPS uptime check using gcloud
gcloud monitoring uptime create my-api-check \
  --display-name="Production API Health Check" \
  --resource-type=uptime-url \
  --hostname=api.example.com \
  --path=/health \
  --port=443 \
  --use-ssl \
  --validate-ssl \
  --check-interval=60 \
  --timeout=10 \
  --content-matchers='content={"status":"healthy"},matcher=CONTAINS_STRING' \
  --regions=USA,EUROPE,ASIA_PACIFIC,SOUTH_AMERICA

# List all uptime checks
gcloud monitoring uptime list-configs

# Describe a specific uptime check
gcloud monitoring uptime describe my-api-check
main.tf - Uptime check and SLO with Terraform
resource "google_monitoring_uptime_check_config" "api_health" {
  display_name = "Production API Health"
  timeout      = "10s"
  period       = "60s"

  http_check {
    path           = "/health"
    port           = 443
    use_ssl        = true
    validate_ssl   = true
    request_method = "GET"

    accepted_response_status_codes {
      status_class = "STATUS_CLASS_2XX"
    }

    content_matchers {
      content = "healthy"
      matcher = "CONTAINS_STRING"
    }
  }

  monitored_resource {
    type = "uptime_url"
    labels = {
      project_id = var.project_id
      host       = "api.example.com"
    }
  }

  selected_regions = ["USA", "EUROPE", "ASIA_PACIFIC"]
}

# Define an SLO based on the uptime check
resource "google_monitoring_slo" "api_availability_slo" {
  service      = google_monitoring_custom_service.api_service.service_id
  display_name = "99.9% Availability SLO"
  goal         = 0.999

  rolling_period_days = 30

  request_based_sli {
    good_total_ratio {
      good_service_filter = join(" AND ", [
        "metric.type=\"monitoring.googleapis.com/uptime_check/check_passed\"",
        "resource.type=\"uptime_url\"",
        "metric.labels.check_id=\"${google_monitoring_uptime_check_config.api_health.uptime_check_id}\""
      ])
      total_service_filter = join(" AND ", [
        "metric.type=\"monitoring.googleapis.com/uptime_check/check_passed\"",
        "resource.type=\"uptime_url\"",
        "metric.labels.check_id=\"${google_monitoring_uptime_check_config.api_health.uptime_check_id}\""
      ])
    }
  }

  calendar_period = "FORTNIGHT"
}

resource "google_monitoring_custom_service" "api_service" {
  service_id   = "api-service"
  display_name = "Production API Service"
}

SLO Error Budget Alerts

Once you define SLOs, create burn rate alerts that trigger when your error budget is being consumed too quickly. A typical multi-window approach uses a fast burn rate (14.4x over 1 hour, confirmed by 5 minutes) for paging and a slow burn rate (3x over 6 hours, confirmed by 30 minutes) for ticketing. This approach, recommended by Google's SRE practices, catches both sudden outages and gradual degradation while minimizing false positives.

Alerting Policies & Notification Channels

Alerting policies define conditions under which Cloud Monitoring sends notifications. Each policy consists of one or more conditions (metric thresholds, absence of data, or MQL queries), a combiner (AND/OR logic for multiple conditions), notification channels (where alerts are sent), and documentation (runbook information included in alert notifications). Policies can be scoped to specific resources or apply broadly across resource types.

Cloud Monitoring supports a wide range of notification channels:

Channel TypeUse CaseLatency
EmailLow-priority notifications, record keepingMinutes
SMSOn-call paging for critical alertsSeconds
PagerDutyIncident management integrationSeconds
SlackTeam channel notificationsSeconds
WebhooksCustom integrations, ChatOpsSeconds
Pub/SubProgrammatic response, automationSeconds
Mobile AppPush notifications to Google Cloud appSeconds
Create alerting policies with gcloud
# Create a notification channel for PagerDuty
gcloud beta monitoring channels create \
  --display-name="Production PagerDuty" \
  --type=pagerduty \
  --channel-labels=service_key=YOUR_PAGERDUTY_SERVICE_KEY

# Create a notification channel for Slack
gcloud beta monitoring channels create \
  --display-name="Alerts Slack Channel" \
  --type=slack \
  --channel-labels=channel_name="#production-alerts",auth_token=xoxb-...

# List notification channels to get their IDs
gcloud beta monitoring channels list --format="table(name, displayName, type)"

# Create an alerting policy for high CPU utilization
gcloud beta monitoring policies create \
  --display-name="High CPU Utilization" \
  --condition-display-name="CPU > 80% for 5 minutes" \
  --condition-filter='resource.type = "gce_instance"
    AND metric.type = "compute.googleapis.com/instance/cpu/utilization"' \
  --condition-threshold-value=0.8 \
  --condition-threshold-comparison=COMPARISON_GT \
  --condition-threshold-duration=300s \
  --condition-threshold-aggregations-aligner=ALIGN_MEAN \
  --condition-threshold-aggregations-period=60s \
  --notification-channels=projects/my-project/notificationChannels/CHANNEL_ID \
  --documentation-content="Runbook: https://wiki.example.com/high-cpu" \
  --documentation-mime-type="text/markdown"
alerting-policy.tf - Terraform alerting with MQL
resource "google_monitoring_alert_policy" "error_rate" {
  display_name = "Cloud Run Error Rate > 1%"
  combiner     = "OR"

  conditions {
    display_name = "Error rate exceeds 1%"

    condition_monitoring_query_language {
      query = <<-MQL
        fetch cloud_run_revision
        | metric 'run.googleapis.com/request_count'
        | filter resource.service_name == 'my-api'
        | align rate(1m)
        | group_by [resource.service_name],
            [total: sum(val()),
             errors: sum(val()) {response_code_class = '5xx'}]
        | value [error_rate: errors / total * 100]
        | condition error_rate > 1
      MQL
      duration = "300s"

      trigger {
        count = 1
      }
    }
  }

  notification_channels = [
    google_monitoring_notification_channel.pagerduty.name,
    google_monitoring_notification_channel.slack.name,
  ]

  alert_strategy {
    auto_close = "1800s"

    notification_rate_limit {
      period = "300s"
    }
  }

  documentation {
    content   = <<-DOC
      ## Cloud Run Error Rate Alert

      **Service:** my-api
      **Threshold:** Error rate > 1% for 5 minutes

      ### Investigation Steps
      1. Check Cloud Run logs: `gcloud run services logs read my-api --limit=50`
      2. Check recent deployments: `gcloud run revisions list --service=my-api`
      3. Roll back if needed: `gcloud run services update-traffic my-api --to-revisions=PREVIOUS_REVISION=100`
    DOC
    mime_type = "text/markdown"
  }
}

resource "google_monitoring_notification_channel" "pagerduty" {
  display_name = "Production PagerDuty"
  type         = "pagerduty"

  labels = {
    service_key = var.pagerduty_service_key
  }
}

resource "google_monitoring_notification_channel" "slack" {
  display_name = "Production Slack"
  type         = "slack"

  labels = {
    channel_name = "#production-alerts"
  }

  sensitive_labels {
    auth_token = var.slack_auth_token
  }
}

Cloud Trace & Profiler

Cloud Trace is a distributed tracing system that collects latency data from your applications to help you understand how requests flow through your microservices architecture. Each trace represents a single user request and consists of multiple spans, individual operations within the request lifecycle such as database queries, RPC calls, and queue operations. Trace data reveals which services are bottlenecks, which external calls add the most latency, and how performance varies across different request paths.

Cloud Run, App Engine, and Cloud Functions automatically generate traces. For other services, you instrument your code using OpenTelemetry or the Cloud Trace client libraries. Cloud Trace is compatible with OpenTelemetry, Zipkin, and the legacy Cloud Trace SDK. The recommended approach is OpenTelemetry, which provides vendor-neutral instrumentation.

OpenTelemetry tracing setup for a Node.js service
// tracing.js - Initialize before any other imports
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { TraceExporter } = require('@google-cloud/opentelemetry-cloud-trace-exporter');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '2.1.0',
  }),
  traceExporter: new TraceExporter({
    projectId: 'my-project',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instrument HTTP, Express, gRPC, pg, redis, etc.
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/health', '/ready'],
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-pg': {
        enhancedDatabaseReporting: true,
      },
    }),
  ],
});

sdk.start();
console.log('OpenTelemetry tracing initialized');

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => console.log('Tracing shut down'));
});

Cloud Profiler

Cloud Profiler continuously gathers CPU, heap, and thread usage data from production applications with minimal overhead (typically less than 0.5% CPU impact). Unlike traditional profiling that requires stopping your application, Cloud Profiler uses statistical sampling to build a profile over time. It supports Go, Java, Node.js, and Python, and integrates with Cloud Trace to correlate slow traces with profiling data, helping you identify the exact code path causing latency issues.

Correlating Traces with Logs

For end-to-end observability, correlate your traces with logs by including the trace context in your log entries. In Cloud Run and GKE, add the logging.googleapis.com/trace field to your structured log output with the format projects/PROJECT_ID/traces/TRACE_ID. This lets you click from a trace in Cloud Trace directly to the corresponding log entries in Cloud Logging, and vice versa. The Logs Explorer will display a "View in Cloud Trace" link when trace context is present.

Error Reporting

Error Reporting automatically analyzes exceptions and stack traces in your log data, groups them by root cause, and presents a prioritized list of errors with occurrence counts, affected user counts, and first/last seen timestamps. It works out of the box with Cloud Functions, Cloud Run, App Engine, and GKE. Any exception written to Cloud Logging with a recognized stack trace format is automatically detected.

Error Reporting is particularly powerful for the following scenarios:

  • New error detection: Receive immediate notifications when a never-before-seen error appears in production, enabling rapid response to regression bugs.
  • Error trending: Track error frequency over time to see if a fix actually resolved the problem or if error rates are creeping back up.
  • Deployment correlation: Overlay error occurrence timelines with deployment events to quickly identify which release introduced a regression.
  • Resolution tracking: Mark errors as resolved or muted, and get re-notified if they recur.
Report errors explicitly from application code
from google.cloud import error_reporting

# Initialize the Error Reporting client
client = error_reporting.Client(project="my-project", service="payment-api")

def process_payment(payment_data):
    try:
        result = charge_card(payment_data)
        return result
    except PaymentGatewayError as e:
        # Report the error with additional context
        client.report_exception(
            http_context=error_reporting.HTTPContext(
                method="POST",
                url="/api/v1/payments",
                user_agent=payment_data.get("user_agent"),
                response_status_code=500,
            ),
            user=payment_data.get("user_id"),
        )
        raise
    except Exception as e:
        # Report unexpected errors
        client.report(str(e))
        raise

# For Cloud Run / GKE, writing a proper stack trace to stderr
# is automatically captured by Error Reporting:
import traceback
import json
import sys

def handle_error(e):
    error_log = {
        "severity": "ERROR",
        "message": str(e),
        "stack_trace": traceback.format_exc(),
        "@type": "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent",
        "serviceContext": {
            "service": "payment-api",
            "version": "2.1.0"
        }
    }
    print(json.dumps(error_log), file=sys.stderr)

Dashboard Design

Effective dashboards tell a story about system health at a glance. Cloud Monitoring dashboards support a rich set of widgets including line charts, stacked area charts, bar charts, gauges, scorecards, heatmaps, log panels, alerting policy summaries, and text annotations. Dashboards can be created interactively through the Cloud Console or defined as JSON/Terraform for version-controlled, reproducible infrastructure.

Google's SRE book recommends organizing dashboards using the USE method (Utilization, Saturation, Errors) for infrastructure and the RED method (Rate, Errors, Duration) for services. A well-designed dashboard hierarchy typically includes:

  • Executive overview: High-level SLO status, error budgets, and key business metrics across all services.
  • Service dashboards: Per-service RED metrics with request rate, error rate, and latency percentiles (p50, p95, p99).
  • Infrastructure dashboards: VM CPU, memory, disk, and network utilization for Compute Engine instances and GKE nodes.
  • Debug dashboards: Detailed breakdowns for troubleshooting, including per-endpoint latency, database query performance, and cache hit rates.
dashboard.tf - Terraform dashboard with MQL widgets
resource "google_monitoring_dashboard" "service_overview" {
  dashboard_json = jsonencode({
    displayName = "API Service Overview"
    mosaicLayout = {
      columns = 12
      tiles = [
        {
          width  = 6
          height = 4
          widget = {
            title = "Request Rate (QPS)"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesQueryLanguage = <<-MQL
                    fetch cloud_run_revision
                    | metric 'run.googleapis.com/request_count'
                    | filter resource.service_name == 'my-api'
                    | align rate(1m)
                    | group_by [], [qps: sum(val())]
                  MQL
                }
                plotType = "LINE"
              }]
              yAxis = { label = "requests/sec" }
            }
          }
        },
        {
          xPos   = 6
          width  = 6
          height = 4
          widget = {
            title = "Error Rate (%)"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesQueryLanguage = <<-MQL
                    fetch cloud_run_revision
                    | metric 'run.googleapis.com/request_count'
                    | filter resource.service_name == 'my-api'
                    | align rate(1m)
                    | group_by [],
                        [total: sum(val()),
                         errors: sum(val()) {response_code_class = '5xx'}]
                    | value [error_pct: errors / total * 100]
                  MQL
                }
                plotType = "LINE"
              }]
              yAxis   = { label = "%" }
              thresholds = [{
                value        = 1
                color        = "RED"
                direction    = "ABOVE"
                targetAxis   = "Y1"
              }]
            }
          }
        },
        {
          yPos   = 4
          width  = 12
          height = 4
          widget = {
            title = "Request Latency (p50, p95, p99)"
            xyChart = {
              dataSets = [
                {
                  timeSeriesQuery = {
                    timeSeriesQueryLanguage = <<-MQL
                      fetch cloud_run_revision
                      | metric 'run.googleapis.com/request_latencies'
                      | filter resource.service_name == 'my-api'
                      | align delta(1m)
                      | group_by [], [p50: percentile(val(), 50)]
                    MQL
                  }
                  plotType = "LINE"
                },
                {
                  timeSeriesQuery = {
                    timeSeriesQueryLanguage = <<-MQL
                      fetch cloud_run_revision
                      | metric 'run.googleapis.com/request_latencies'
                      | filter resource.service_name == 'my-api'
                      | align delta(1m)
                      | group_by [], [p95: percentile(val(), 95)]
                    MQL
                  }
                  plotType = "LINE"
                },
                {
                  timeSeriesQuery = {
                    timeSeriesQueryLanguage = <<-MQL
                      fetch cloud_run_revision
                      | metric 'run.googleapis.com/request_latencies'
                      | filter resource.service_name == 'my-api'
                      | align delta(1m)
                      | group_by [], [p99: percentile(val(), 99)]
                    MQL
                  }
                  plotType = "LINE"
                }
              ]
              yAxis = { label = "ms" }
            }
          }
        }
      ]
    }
  })
}

Cost Optimization & Best Practices

Cloud Logging and Monitoring costs can escalate quickly if left unmanaged. Logging charges are primarily based on ingestion volume ($0.50/GiB after the first 50 GiB/month free tier), while Monitoring charges are based on custom and log-based metric sample volumes and the number of metric descriptors. Understanding these cost drivers is essential for maintaining a cost-effective observability strategy.

The most impactful cost optimization strategies include:

  • Exclusion filters: Add exclusion filters to the_Default sink to prevent high-volume, low-value logs from being ingested. Common candidates include debug logs, health check access logs, and verbose GKE system component logs.
  • Sampling: For extremely high-volume services, sample logs at ingestion time rather than sending every entry. A 10% sample of 1 billion log entries is still 100 million entries, which is more than sufficient for trend analysis and debugging.
  • Retention tuning: Configure shorter retention periods for non-critical log buckets. Not every log needs 30 days of online retention. Use Cloud Storage sinks for long-term archival at a fraction of the cost.
  • Metric cardinality control: Avoid creating custom metrics with high-cardinality labels (such as user IDs or session tokens). Each unique label combination creates a separate time-series, and costs scale linearly with the number of active time-series.
  • Log-based metric cleanup: Regularly audit log-based metrics. Each log-based metric evaluates every matching log entry, adding processing overhead. Remove metrics that are no longer used in dashboards or alerting policies.

Monitor Your Monitoring Costs

Create an alerting policy on the logging.googleapis.com/billing/bytes_ingested metric to get notified when log ingestion exceeds your budget threshold. It is surprisingly common for a single misconfigured application to generate terabytes of debug logs in a matter of hours, resulting in thousands of dollars in unexpected charges. Set a budget alert at 80% of your expected monthly ingestion to catch spikes before they become expensive.

Cost optimization exclusion filters
# Exclude health check logs from the _Default sink (often 50%+ of log volume)
gcloud logging sinks update _Default \
  --add-exclusion='name=exclude-health-checks,
    filter=httpRequest.requestUrl="/health" OR httpRequest.requestUrl="/ready" OR httpRequest.requestUrl="/healthz"'

# Exclude GKE system component verbose logs
gcloud logging sinks update _Default \
  --add-exclusion='name=exclude-gke-system,
    filter=resource.type="k8s_container"
    AND resource.labels.namespace_name="kube-system"
    AND severity < WARNING'

# Exclude Cloud Load Balancer 2xx access logs (keep errors only)
gcloud logging sinks update _Default \
  --add-exclusion='name=exclude-lb-success,
    filter=resource.type="http_load_balancer"
    AND httpRequest.status < 400'

# View current ingestion volume by log name
gcloud logging read "timestamp >= \"$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)\"" \
  --format="value(logName)" | sort | uniq -c | sort -rn | head -20

# Check bytes ingested per log source
bq query --use_legacy_sql=false '
SELECT
  log_id,
  ROUND(SUM(BYTE_LENGTH(TO_JSON_STRING(json_payload))) / 1073741824, 2) AS gib_ingested,
  COUNT(*) AS entry_count
FROM `my-project.global._Default._AllLogs`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY log_id
ORDER BY gib_ingested DESC
LIMIT 20'

Observability Maturity Model

As your GCP deployment grows, your observability strategy should mature from basic reactive monitoring to proactive, SLO-driven operations. A practical maturity model progresses through these stages:

StageCapabilitiesKey Actions
1. FoundationsBasic logging, uptime checks, default dashboardsEnable Ops Agent, create uptime checks, set up notification channels
2. StructuredStructured logging, custom metrics, log sinksImplement structured logging, create log-based metrics, export to BigQuery
3. IntegratedDistributed tracing, correlated logs/traces, service dashboardsDeploy OpenTelemetry, correlate traces with logs, build RED dashboards
4. SLO-DrivenSLOs, error budgets, burn rate alerts, capacity planningDefine SLOs, implement error budget policies, automate incident response
5. PredictiveAnomaly detection, AIOps, automated remediationEnable anomaly detection, build auto-remediation pipelines, use ML-based alerts

The key to successful observability is starting with clear objectives. Before instrumenting a single metric or writing a single log line, ask: "What questions do I need to answer when something goes wrong?" This question-driven approach ensures that every piece of telemetry you collect serves a purpose, keeping costs manageable while providing maximum diagnostic value. Remember that observability is not about collecting everything; it is about collecting the right things and making them actionable.

Key Takeaways

  1. 1Cloud Logging provides centralized log collection with the Log Router for routing and filtering.
  2. 2Cloud Monitoring offers metrics, dashboards, uptime checks, and SLO monitoring.
  3. 3Log-based metrics convert log patterns into custom metrics for alerting and dashboards.
  4. 4Cloud Trace provides distributed tracing with automatic instrumentation for GCP services.
  5. 5Alerting policies with notification channels automate incident detection and response.
  6. 6Log sinks route logs to Cloud Storage, BigQuery, or Pub/Sub for long-term analysis and compliance.

Frequently Asked Questions

How is Cloud Logging priced?
The first 50 GB of log ingestion per project per month is free. Beyond that, ingestion costs $0.50/GB. Log storage beyond the default 30-day retention costs $0.01/GB/month. You can reduce costs by excluding verbose logs via log exclusion filters in the Log Router.
What is the Log Router?
The Log Router is the central component that processes every log entry written to Cloud Logging. It evaluates inclusion/exclusion filters and routes logs to destinations (log buckets, Cloud Storage, BigQuery, Pub/Sub). You can create sinks to export specific logs for long-term retention or analysis.
How do uptime checks work?
Uptime checks send HTTP, HTTPS, or TCP probes to your endpoints from globally distributed checkers every 1-15 minutes. If a check fails from multiple locations, it triggers an alerting policy. Uptime checks can validate response codes, content, and TLS certificates.
Can I use Cloud Monitoring with non-GCP resources?
Yes. The Ops Agent (Cloud Monitoring Agent) can be installed on any Linux or Windows server, including on-premises and other cloud providers. You can also use the Monitoring API to write custom metrics from any source.
What is the difference between Cloud Monitoring and Cloud Trace?
Cloud Monitoring collects metrics and manages alerts for infrastructure and application health. Cloud Trace specifically tracks request latency across distributed services, showing where time is spent in each service hop. They complement each other for full observability.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.