OCI Monitoring & Alarms Guide
Monitor OCI resources with metrics, MQL queries, custom metrics, alarm configurations, dashboards, and production monitoring strategies.
Prerequisites
- OCI account with monitoring permissions
- Basic understanding of cloud monitoring concepts
Introduction to OCI Monitoring
Oracle Cloud Infrastructure Monitoring is a fully managed service that collects, aggregates, and analyzes metrics from OCI resources and custom applications. It provides real-time visibility into the health and performance of your infrastructure through metrics, dashboards, and alarms. The Monitoring service is integrated with every OCI service, meaning you get baseline metrics for compute, networking, storage, databases, and more without installing any agents.
OCI Monitoring is built around the Monitoring Query Language (MQL), a powerful syntax for querying, filtering, aggregating, and transforming metric data. MQL enables you to write sophisticated queries that combine multiple metrics, apply statistical functions, and create derived metrics for complex monitoring scenarios.
The service integrates tightly with OCI Notifications (ONS) for alarm delivery, OCI Functions for automated remediation, and OCI Logging for correlated troubleshooting. This guide covers metric collection, MQL query syntax, alarm configuration, custom metrics, dashboard creation, and production monitoring strategies.
Monitoring Service Pricing
OCI Monitoring includes 500 million ingested data points per month and 1 billion retrieved data points per month at no additional cost. Custom metric ingestion beyond the free tier is billed at a competitive rate. Alarms are free to create and evaluate. This generous allocation means most workloads can be fully monitored within the free allowance.
Understanding OCI Metrics
Metrics are the foundation of OCI Monitoring. A metric is a time-series data point that represents a measurement of a resource's behavior at a specific point in time. Each metric has a namespace, name, dimensions, and a value.
Namespace: Groups related metrics by the OCI service that emits them. For example, oci_computeagent for compute instance metrics,oci_blockstore for block volume metrics, and oci_lbaas for load balancer metrics.
Metric Name: Identifies the specific measurement, such asCpuUtilization, MemoryUtilization, orNetworkBytesIn.
Dimensions: Key-value pairs that provide additional context for filtering, such as resourceId, availabilityDomain, orfaultDomain. Dimensions allow you to drill down into specific resources or aggregate across groups.
Resolution: Metrics are collected at intervals specific to each service. Compute metrics have a 1-minute resolution, while some services use 5-minute intervals. Raw metrics are retained for 7 days, and aggregated (1-hour) metrics are retained for 90 days.
# List metric namespaces available in your compartment
oci monitoring metric list \
--compartment-id $C \
--query 'data[].namespace' \
--output table | sort -u
# List metrics within a namespace
oci monitoring metric list \
--compartment-id $C \
--namespace "oci_computeagent" \
--query 'data[].{name:name, namespace:namespace}' \
--output table
# Common metric namespaces:
# oci_computeagent - Compute instance metrics
# oci_blockstore - Block volume metrics
# oci_vcn - VCN flow log metrics
# oci_lbaas - Load balancer metrics
# oci_objectstorage - Object storage metrics
# oci_autonomousdb - Autonomous Database metrics
# oci_oke - Kubernetes Engine metrics
# oci_faas - Functions metrics
# oci_ons - Notifications metricsMonitoring Query Language (MQL)
MQL is the query language used to retrieve and transform metric data in OCI Monitoring. It follows a structured syntax that specifies the metric name, time interval, optional dimension filters, and an aggregation function. Understanding MQL is essential for creating effective alarms and dashboards.
The basic MQL syntax is: MetricName[interval]{dimensionFilters}.statistic()
Interval: The time window for aggregation, specified in minutes (m), hours (h), or days (d). For example, [5m] aggregates data over 5-minute windows.
Dimension Filters: Optional filters that narrow the query to specific resources or groups. Specified inside curly braces with the syntax{dimension = "value"}.
Statistics: Aggregation functions applied to the data points within each interval. Common statistics include mean(), max(),min(), sum(), count(), rate(), andpercentile(0.95).
# Query CPU utilization for a specific instance (last hour)
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'CpuUtilization[5m]{resourceId = "<instance-ocid>"}.mean()'
# Query average CPU across all instances in a compartment
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'CpuUtilization[5m].mean()'
# Query memory utilization (requires monitoring agent)
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'MemoryUtilization[5m]{resourceId = "<instance-ocid>"}.mean()'
# Query 95th percentile response time for a load balancer
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_lbaas" \
--query-text 'BackendTimeFirstByte[5m]{resourceId = "<lb-ocid>"}.percentile(0.95)'
# Query block volume IOPS rate
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_blockstore" \
--query-text 'VolumeReadOps[1m]{resourceId = "<volume-ocid>"}.rate()'
# Query network ingress bytes for a VCN
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_vcn" \
--query-text 'VnicFromNetworkBytes[5m]{resourceId = "<vnic-ocid>"}.sum()'
# Combine metrics with arithmetic (custom expression)
# Total IOPS = ReadOps + WriteOps
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_blockstore" \
--query-text 'VolumeReadOps[5m]{resourceId = "<vol-ocid>"}.rate() + VolumeWriteOps[5m]{resourceId = "<vol-ocid>"}.rate()'Use Grouping for Fleet Monitoring
MQL supports the groupBy() function to aggregate metrics across dimensions. For example, CpuUtilization[5m].groupBy(availabilityDomain).mean() returns average CPU utilization broken down by availability domain. This is powerful for fleet- wide monitoring where you need to compare performance across groups of resources without querying each one individually.
Configuring Alarms
Alarms evaluate MQL queries at regular intervals and trigger notifications when the query result exceeds a defined threshold. An alarm transitions between three states:OK (metric is within normal range), FIRING (metric has exceeded the threshold), and RESET (alarm is transitioning back to OK after a resolution).
Each alarm is associated with one or more notification destinations (ONS topics) that receive alerts when the alarm state changes. You can configure different destinations for different severity levels (CRITICAL, ERROR, WARNING, INFO) and set repeat notification intervals for ongoing issues.
# Create an alarm for high CPU utilization
oci monitoring alarm create \
--compartment-id $C \
--display-name "high-cpu-warning" \
--metric-compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'CpuUtilization[5m].mean() > 80' \
--severity "WARNING" \
--destinations '["<ops-topic-ocid>"]' \
--is-enabled true \
--body "Average CPU utilization exceeds 80% for 5 minutes" \
--pending-duration "PT5M" \
--repeat-notification-duration "PT15M"
# Create a critical alarm for CPU > 95%
oci monitoring alarm create \
--compartment-id $C \
--display-name "high-cpu-critical" \
--metric-compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'CpuUtilization[5m].mean() > 95' \
--severity "CRITICAL" \
--destinations '["<critical-topic-ocid>"]' \
--is-enabled true \
--body "CRITICAL: CPU utilization exceeds 95%" \
--pending-duration "PT3M" \
--repeat-notification-duration "PT5M"
# Create a disk space alarm
oci monitoring alarm create \
--compartment-id $C \
--display-name "low-disk-space" \
--metric-compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'DiskBytesUsed[5m].mean() / DiskBytesTotal[5m].mean() * 100 > 85' \
--severity "WARNING" \
--destinations '["<ops-topic-ocid>"]' \
--is-enabled true \
--body "Disk utilization exceeds 85%"
# Create an alarm for unhealthy load balancer backends
oci monitoring alarm create \
--compartment-id $C \
--display-name "lb-unhealthy-backends" \
--metric-compartment-id $C \
--namespace "oci_lbaas" \
--query-text 'UnHealthyBackendServers[1m]{resourceId = "<lb-ocid>"}.max() > 0' \
--severity "CRITICAL" \
--destinations '["<critical-topic-ocid>"]' \
--is-enabled true \
--body "One or more load balancer backends are unhealthy"
# List all alarms
oci monitoring alarm list \
--compartment-id $C \
--query 'data[].{"display-name":"display-name", severity:severity, "lifecycle-state":"lifecycle-state"}' \
--output table
# Get alarm status (current state)
oci monitoring alarm-status list-alarms-status \
--compartment-id $C \
--query 'data[].{"display-name":"display-name", status:status, severity:severity}' \
--output tableAvoid Alarm Fatigue
One of the biggest operational risks is alarm fatigue, where too many non-actionable alerts cause teams to ignore all alerts. Only create alarms for conditions that require human action. Use pending-duration to suppress transient spikes (e.g., require 5 minutes of sustained high CPU before alerting). Userepeat-notification-duration to limit alert frequency for ongoing issues. Review and prune unused alarms quarterly.
Custom Metrics
While OCI automatically collects infrastructure metrics, you can publish custom application metrics to the Monitoring service for a unified monitoring experience. Custom metrics use the same MQL query language and alarm system as built-in metrics, enabling consistent monitoring across infrastructure and application layers.
Custom metrics are published using the PostMetricData API. Each data point includes a namespace (use a custom namespace to avoid conflicts with OCI namespaces), metric name, dimensions, timestamp, value, and count. You can batch multiple data points in a single API call for efficiency.
# Post custom metric data
oci monitoring metric-data post \
--metric-data '[{
"namespace": "custom_app",
"compartmentId": "<compartment-ocid>",
"name": "OrdersProcessed",
"dimensions": {"service": "order-api", "environment": "production"},
"datapoints": [{
"timestamp": "2026-03-14T10:00:00.000Z",
"value": 150,
"count": 1
}]
}]'
# Post multiple metrics in a batch
oci monitoring metric-data post \
--metric-data '[{
"namespace": "custom_app",
"compartmentId": "<compartment-ocid>",
"name": "ResponseTimeMs",
"dimensions": {"endpoint": "/api/orders", "method": "POST"},
"datapoints": [{
"timestamp": "2026-03-14T10:00:00.000Z",
"value": 245.5,
"count": 100
}]
}, {
"namespace": "custom_app",
"compartmentId": "<compartment-ocid>",
"name": "ErrorCount",
"dimensions": {"endpoint": "/api/orders", "errorType": "timeout"},
"datapoints": [{
"timestamp": "2026-03-14T10:00:00.000Z",
"value": 3,
"count": 1
}]
}]'
# Query custom metrics
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "custom_app" \
--query-text 'OrdersProcessed[5m]{service = "order-api"}.sum()'
# Create an alarm on custom metrics
oci monitoring alarm create \
--compartment-id $C \
--display-name "high-error-rate" \
--metric-compartment-id $C \
--namespace "custom_app" \
--query-text 'ErrorCount[5m]{errorType = "timeout"}.sum() > 10' \
--severity "WARNING" \
--destinations '["<ops-topic-ocid>"]' \
--is-enabled true \
--body "Application timeout errors exceed threshold"OCI Monitoring Agent
The OCI Monitoring Agent (also called the Compute Instance Monitoring plugin) collects OS-level metrics from within your compute instances. Without the agent, the Monitoring service only sees hypervisor-level metrics like CPU and network I/O. With the agent, you get additional metrics including memory utilization, disk utilization, process counts, and custom log-based metrics.
The monitoring agent is available as an Oracle Cloud Agent plugin and can be enabled during instance creation or added to running instances. It requires a dynamic group policy that grants the instance permission to publish metrics to the Monitoring service.
# Enable the monitoring agent on an existing instance
oci compute instance update \
--instance-id <instance-ocid> \
--agent-config '{"pluginsConfig": [{"name": "Compute Instance Monitoring", "desiredState": "ENABLED"}]}'
# Create a dynamic group for instances that can publish metrics
oci iam dynamic-group create \
--compartment-id <tenancy-ocid> \
--name "monitoring-instances" \
--description "Instances that publish custom metrics" \
--matching-rule "ALL {instance.compartment.id = '<compartment-ocid>'}"
# Create a policy allowing the dynamic group to publish metrics
oci iam policy create \
--compartment-id $C \
--name "monitoring-publish-policy" \
--description "Allow instances to publish metrics" \
--statements '["Allow dynamic-group monitoring-instances to use metrics in compartment <compartment-name> where target.metrics.namespace = '\''custom_app'\''"]'
# Verify agent status on an instance
oci compute instance get \
--instance-id <instance-ocid> \
--query 'data."agent-config"."plugins-config"'Dashboards and Visualization
OCI Console provides built-in metric dashboards for each service, but you can also create custom dashboards that combine metrics from multiple services into a single view. Custom dashboards are useful for application-specific monitoring views that combine infrastructure, application, and business metrics.
Dashboards support multiple widget types including line charts, area charts, bar charts, and single-value displays. Each widget contains one or more MQL queries, and you can configure the time range, refresh interval, and visual appearance.
For advanced visualization, you can integrate OCI Monitoring with external tools like Grafana using the OCI Monitoring data source plugin. This provides access to Grafana's rich visualization capabilities, alerting system, and dashboard sharing features.
# Grafana integration with OCI Monitoring:
# 1. Install the Oracle Cloud Infrastructure Monitoring data source plugin
# grafana-cli plugins install oci-metrics-datasource
# 2. Configure the data source in Grafana:
# - Type: Oracle Cloud Infrastructure Metrics
# - Tenancy OCID: ocid1.tenancy...
# - User OCID: ocid1.user...
# - Private Key: (API signing key)
# - Region: us-ashburn-1
# 3. Example Grafana MQL queries:
# CpuUtilization[1m]{resourceDisplayName = "web-server-*"}.mean()
# MemoryUtilization[1m]{availabilityDomain = "US-ASHBURN-AD-1"}.percentile(0.95)
# HttpRequests[1m]{resourceId = "<lb-ocid>"}.sum()
# Export metric data for external analysis
oci monitoring metric-data summarize-metrics-data \
--compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'CpuUtilization[1h].mean()' \
--start-time "2026-03-01T00:00:00Z" \
--end-time "2026-03-14T00:00:00Z" \
--output json > cpu_metrics.jsonProduction Monitoring Strategy
An effective monitoring strategy follows the principle of layered observability, starting with infrastructure metrics and building up through application and business metrics. Here is a recommended monitoring stack for production OCI environments:
Layer 1 - Infrastructure: Monitor CPU, memory, disk, and network for all compute instances. Create alarms for sustained high utilization (>80% for 5+ minutes) and critically low resources (<10% disk free). Use compartment-level aggregate queries for fleet-wide visibility.
Layer 2 - Platform Services: Monitor load balancer health, database performance (CPU, storage, sessions), and OKE cluster metrics (node readiness, pod counts, API server latency). These alarms should correlate with infrastructure issues.
Layer 3 - Application: Publish custom metrics for request rates, error rates, response times, and queue depths. Use the RED method (Rate, Errors, Duration) for service-level monitoring. Create SLO-based alarms that measure actual user experience.
Layer 4 - Business: Track business metrics like orders processed, revenue per minute, active users, and conversion rates. These metrics help detect issues that impact the business even when infrastructure appears healthy.
# Essential production alarms checklist:
# 1. Compute - High CPU
oci monitoring alarm create \
--compartment-id $C \
--display-name "compute-high-cpu" \
--metric-compartment-id $C \
--namespace "oci_computeagent" \
--query-text 'CpuUtilization[5m].groupBy(resourceId).mean() > 85' \
--severity "WARNING" \
--destinations '["<ops-topic-ocid>"]' \
--is-enabled true \
--pending-duration "PT10M" \
--body "Compute instance CPU > 85% for 10 minutes"
# 2. Database - High CPU
oci monitoring alarm create \
--compartment-id $C \
--display-name "adb-high-cpu" \
--metric-compartment-id $C \
--namespace "oci_autonomous_database" \
--query-text 'CpuUtilization[5m]{resourceId = "<adb-ocid>"}.mean() > 80' \
--severity "WARNING" \
--destinations '["<ops-topic-ocid>"]' \
--is-enabled true \
--body "Autonomous Database CPU > 80%"
# 3. Load Balancer - 5xx Errors
oci monitoring alarm create \
--compartment-id $C \
--display-name "lb-5xx-errors" \
--metric-compartment-id $C \
--namespace "oci_lbaas" \
--query-text 'HttpResponses[5m]{statusCode = "5xx"}.sum() > 50' \
--severity "CRITICAL" \
--destinations '["<critical-topic-ocid>"]' \
--is-enabled true \
--body "Load balancer returning excessive 5xx errors"
# 4. OKE - Node Not Ready
oci monitoring alarm create \
--compartment-id $C \
--display-name "oke-node-not-ready" \
--metric-compartment-id $C \
--namespace "oci_oke" \
--query-text 'node_status[1m]{status = "NotReady"}.count() > 0' \
--severity "CRITICAL" \
--destinations '["<critical-topic-ocid>"]' \
--is-enabled true \
--body "OKE cluster has nodes in NotReady state"Correlate Metrics with Logs
When an alarm fires, the first troubleshooting step is usually to check logs. Use the OCI Logging service alongside Monitoring to store and search application and infrastructure logs. Create saved searches in Logging that correspond to your alarm conditions, so you can quickly jump from an alarm notification to the relevant log entries. The OCI Console provides a unified view that lets you overlay metric graphs with log event timelines.
Key Takeaways
- 1MQL (Monitoring Query Language) enables powerful metric queries with statistical functions and grouping.
- 2Custom metrics allow publishing application-level data points to OCI Monitoring for unified visibility.
- 3The Monitoring Agent plugin provides OS-level metrics including memory and disk utilization.
- 4A layered monitoring strategy covers infrastructure, platform, application, and business metrics.
Frequently Asked Questions
What metrics are available without installing an agent?
How long are metrics retained?
Written by CloudToolStack Team
Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.
Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.