GCPServerlessintermediate

Cloud Run for Production

Guide to running Cloud Run in production covering min instances, concurrency tuning, VPC connectivity, custom domains, traffic splitting, CPU allocation, and monitoring.

CloudToolStack Editorial22 min readPublished Mar 14, 2026

Prerequisites

Experience building and containerizing applications (Docker)
Basic understanding of GCP services
Familiarity with HTTP-based service architecture

Why Cloud Run for Production?

Google Cloud Run is a fully managed serverless platform that runs stateless containers. You provide a container image, Cloud Run handles everything else: provisioning infrastructure, scaling from zero to thousands of instances, load balancing, TLS termination, and health checking. You pay only for the compute time your container actually uses, measured in 100ms increments.

Cloud Run occupies a sweet spot between serverless functions (Cloud Functions/Lambda) and container orchestration (GKE/EKS). It gives you the operational simplicity of serverless with the flexibility of containers: any language, any framework, any binary, as long as it listens on a port and responds to HTTP requests. There is no cluster to manage, no node pool to size, and no control plane to pay for.

However, running Cloud Run in production requires understanding its configuration options and limitations. This guide covers the critical production configurations: minimum instances to eliminate cold starts, concurrency tuning for optimal performance, VPC connectivity for private resources, custom domains, traffic splitting for deployments, CPU allocation, startup probes, and monitoring.

Cloud Run Pricing

Cloud Run charges for CPU, memory, and requests. CPU costs $0.00002400/vCPU-second, memory costs $0.00000250/GiB-second, and requests cost $0.40/million. The free tier includes 2 million requests, 180,000 vCPU-seconds, and 360,000 GiB-seconds per month. With minimum instances, you pay for idle CPU and memory at 10% of the active rate (when the instance is not processing requests). Always-on CPU allocation changes this to 100% cost regardless of request activity.

Deploying Your First Service

A Cloud Run service consists of one or more revisions, each backed by a container image. Every deployment creates a new revision, and you can control how traffic is distributed between revisions for canary deployments and rollbacks.

bash

# Build and push a container image using Cloud Build
gcloud builds submit \
  --tag gcr.io/my-project/my-api:v1.0.0 \
  --project my-project

# Or use Artifact Registry (recommended over Container Registry)
gcloud artifacts repositories create my-repo \
  --repository-format docker \
  --location us-central1

gcloud builds submit \
  --tag us-central1-docker.pkg.dev/my-project/my-repo/my-api:v1.0.0

# Deploy to Cloud Run
gcloud run deploy my-api \
  --image us-central1-docker.pkg.dev/my-project/my-repo/my-api:v1.0.0 \
  --platform managed \
  --region us-central1 \
  --port 8080 \
  --allow-unauthenticated \
  --service-account "my-api-sa@my-project.iam.gserviceaccount.com"

# Verify the deployment
gcloud run services describe my-api \
  --region us-central1 \
  --format="table(status.url, status.traffic[].percent, status.traffic[].revisionName)"

Eliminating Cold Starts with Minimum Instances

Cold starts are the primary production concern with Cloud Run. When a new instance needs to be created (because existing instances are at capacity or all instances have scaled to zero), there is a delay while the container starts up. This delay ranges from sub-second for lightweight Go/Rust binaries to 10+ seconds for JVM-based applications with large classpaths.

Minimum instances keep a specified number of container instances warm and ready to serve requests immediately. These idle instances cost 10% of the active CPU rate (or 100% if you use always-allocated CPU). For production APIs where latency matters, minimum instances are essential.

bash

# Set minimum instances to eliminate cold starts
gcloud run services update my-api \
  --region us-central1 \
  --min-instances 2 \
  --max-instances 100

# Configure startup probe to ensure instances are ready
gcloud run services update my-api \
  --region us-central1 \
  --startup-probe-path "/healthz" \
  --startup-probe-initial-delay 0 \
  --startup-probe-timeout 3 \
  --startup-probe-period 3 \
  --startup-probe-failure-threshold 10

# Configure liveness probe for ongoing health checks
gcloud run services update my-api \
  --region us-central1 \
  --liveness-probe-path "/healthz" \
  --liveness-probe-initial-delay 10 \
  --liveness-probe-timeout 3 \
  --liveness-probe-period 30 \
  --liveness-probe-failure-threshold 3

Reduce Cold Start Duration

Before paying for minimum instances, optimize your container startup time. Use smaller base images (distroless or Alpine), minimize dependencies, defer heavy initialization to the first request using lazy loading, and precompile assets at build time. A Go or Rust service can start in under 100ms, making cold starts imperceptible. For JVM apps, use GraalVM native image or Class Data Sharing (CDS) to reduce startup from 10s to under 1s.

Concurrency Configuration

Concurrency controls how many requests a single container instance handles simultaneously. The default is 80 concurrent requests per instance. This setting dramatically affects both performance and cost: higher concurrency means fewer instances (lower cost) but requires your application to handle parallel requests efficiently.

For CPU-bound workloads (image processing, ML inference), set concurrency to 1 so each request gets dedicated CPU. For I/O-bound workloads (API proxies, database queries), set concurrency high (80-250) since most time is spent waiting for external responses.

bash

# Set concurrency based on workload type
# I/O-bound API (database queries, external API calls)
gcloud run services update my-api \
  --region us-central1 \
  --concurrency 100

# CPU-bound worker (ML inference, image processing)
gcloud run services update my-worker \
  --region us-central1 \
  --concurrency 1 \
  --cpu 4 \
  --memory 8Gi

# Balanced web application
gcloud run services update my-web \
  --region us-central1 \
  --concurrency 50 \
  --cpu 2 \
  --memory 2Gi

Concurrency Guidelines

Workload Type	Recommended Concurrency	CPU	Memory
REST API (I/O-bound)	80-250	1-2 vCPU	512Mi-2Gi
GraphQL API	50-100	2 vCPU	1-4Gi
Web application (SSR)	20-80	1-2 vCPU	512Mi-2Gi
ML inference	1-4	4-8 vCPU	4-32Gi
Image/video processing	1	4 vCPU	4-8Gi
Background worker	1-10	1-2 vCPU	512Mi-4Gi

CPU Allocation Strategy

Cloud Run offers two CPU allocation modes: request-based (default) andalways allocated. In request-based mode, CPU is only allocated while a request is being processed. Between requests, your container has no CPU and cannot perform background work. In always-allocated mode, CPU is available continuously, enabling background tasks, connection pooling, and in-memory caching.

bash

# Always-allocated CPU (recommended for production APIs)
gcloud run services update my-api \
  --region us-central1 \
  --cpu-throttling \
  --no-cpu-throttling  # This enables always-allocated CPU

# Or explicitly set CPU allocation
gcloud run services update my-api \
  --region us-central1 \
  --cpu 2 \
  --memory 2Gi \
  --no-cpu-throttling \
  --execution-environment gen2  # Second gen for better CPU performance

# For GPU workloads (ML inference)
gcloud run services update my-ml-service \
  --region us-central1 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --cpu 8 \
  --memory 32Gi \
  --concurrency 1 \
  --no-cpu-throttling

CPU Throttling Impacts Connection Pools

With request-based CPU allocation (default), your container has no CPU between requests. This means database connection pools cannot send keepalive packets, WebSocket connections drop, and in-memory caches become stale. If your application uses connection pools (Postgres, Redis, etc.), you must use always-allocated CPU or accept that connections will be re-established for each request, increasing latency.

VPC Connectivity

By default, Cloud Run services have outbound internet access but cannot reach resources in your VPC (private Cloud SQL instances, Memorystore Redis, GKE internal services). To connect to VPC resources, you need either a VPC connector (Serverless VPC Access) or Direct VPC egress.

bash

# Option 1: Direct VPC egress (recommended, newer approach)
gcloud run services update my-api \
  --region us-central1 \
  --network shared-vpc-network \
  --subnet snet-cloud-run \
  --network-tags cloud-run-egress \
  --vpc-egress all-traffic

# Option 2: Serverless VPC Access connector (legacy approach)
gcloud compute networks vpc-access connectors create cloud-run-connector \
  --region us-central1 \
  --network shared-vpc-network \
  --range 10.99.0.0/28 \
  --min-instances 2 \
  --max-instances 10 \
  --machine-type e2-micro

# Attach the connector to your service
gcloud run services update my-api \
  --region us-central1 \
  --vpc-connector cloud-run-connector \
  --vpc-egress all-traffic

# Connect to Cloud SQL via Private IP
gcloud run services update my-api \
  --region us-central1 \
  --add-cloudsql-instances my-project:us-central1:my-database \
  --set-env-vars "DB_HOST=/cloudsql/my-project:us-central1:my-database"

# Or connect via Private IP with VPC connector
gcloud run services update my-api \
  --region us-central1 \
  --vpc-connector cloud-run-connector \
  --set-env-vars "DB_HOST=10.0.0.5,DB_PORT=5432"

Custom Domains and HTTPS

Cloud Run automatically provides a *.run.app domain with a managed TLS certificate. For production services, you will want a custom domain. Cloud Run supports custom domain mapping with automatic certificate provisioning, or you can use a global external Application Load Balancer for advanced routing and CDN integration.

bash

# Option 1: Cloud Run domain mapping (simple)
gcloud run domain-mappings create \
  --service my-api \
  --domain api.contoso.com \
  --region us-central1

# Get the DNS records to configure
gcloud run domain-mappings describe \
  --domain api.contoso.com \
  --region us-central1 \
  --format="table(resourceRecords[].type, resourceRecords[].rrdata)"

# Option 2: Global External Application Load Balancer (production)
# This provides CDN, WAF, multi-region routing, and advanced traffic management

# Create a serverless NEG for Cloud Run
gcloud compute network-endpoint-groups create my-api-neg \
  --region us-central1 \
  --network-endpoint-type serverless \
  --cloud-run-service my-api

# Create backend service
gcloud compute backend-services create my-api-backend \
  --global \
  --load-balancing-scheme EXTERNAL_MANAGED \
  --protocol HTTPS

gcloud compute backend-services add-backend my-api-backend \
  --global \
  --network-endpoint-group my-api-neg \
  --network-endpoint-group-region us-central1

# Create URL map
gcloud compute url-maps create my-api-urlmap \
  --default-service my-api-backend

# Create managed SSL certificate
gcloud compute ssl-certificates create my-api-cert \
  --domains api.contoso.com \
  --global

# Create HTTPS proxy and forwarding rule
gcloud compute target-https-proxies create my-api-https-proxy \
  --url-map my-api-urlmap \
  --ssl-certificates my-api-cert

gcloud compute forwarding-rules create my-api-lb \
  --global \
  --load-balancing-scheme EXTERNAL_MANAGED \
  --target-https-proxy my-api-https-proxy \
  --ports 443

# Enable Cloud CDN on the backend
gcloud compute backend-services update my-api-backend \
  --global \
  --enable-cdn \
  --cache-mode CACHE_ALL_STATIC

Load Balancer Benefits

Using a Global Application Load Balancer instead of direct Cloud Run domain mapping adds cost ($0.025/hour + data processing) but provides: multi-region routing with automatic failover, Cloud CDN for static asset caching, Cloud Armor WAF for DDoS protection and IP filtering, advanced URL routing (path-based, header-based), and the ability to serve multiple Cloud Run services behind a single domain.

Traffic Splitting and Canary Deployments

Cloud Run revisions enable sophisticated deployment strategies. Every deployment creates a new revision, and you can split traffic between revisions for canary releases, A/B testing, or gradual rollouts. You can also tag revisions with custom URLs for testing before routing production traffic.

bash

# Deploy a new revision without routing traffic to it
gcloud run deploy my-api \
  --image us-central1-docker.pkg.dev/my-project/my-repo/my-api:v2.0.0 \
  --region us-central1 \
  --no-traffic \
  --tag canary

# The canary revision gets a URL like: canary---my-api-xyz.a.run.app
# Test it thoroughly before routing traffic

# Route 10% of traffic to the canary
gcloud run services update-traffic my-api \
  --region us-central1 \
  --to-tags canary=10

# Check metrics, errors, latency...
# If healthy, increase to 50%
gcloud run services update-traffic my-api \
  --region us-central1 \
  --to-tags canary=50

# If everything looks good, route 100%
gcloud run services update-traffic my-api \
  --region us-central1 \
  --to-latest

# If something goes wrong, instant rollback
gcloud run services update-traffic my-api \
  --region us-central1 \
  --to-revisions my-api-v1-revision=100

# View current traffic split
gcloud run services describe my-api \
  --region us-central1 \
  --format="table(status.traffic[].percent, status.traffic[].revisionName, status.traffic[].tag)"

Environment Configuration and Secrets

Production services need configuration values and secrets. Cloud Run integrates with Secret Manager for sensitive values and supports environment variables for non-sensitive configuration. Secrets can be mounted as environment variables or files.

bash

# Create secrets in Secret Manager
echo -n "my-database-password" | gcloud secrets create db-password --data-file=-
echo -n "sk-my-api-key-value" | gcloud secrets create external-api-key --data-file=-

# Grant Cloud Run service account access to secrets
gcloud secrets add-iam-policy-binding db-password \
  --member "serviceAccount:my-api-sa@my-project.iam.gserviceaccount.com" \
  --role "roles/secretmanager.secretAccessor"

gcloud secrets add-iam-policy-binding external-api-key \
  --member "serviceAccount:my-api-sa@my-project.iam.gserviceaccount.com" \
  --role "roles/secretmanager.secretAccessor"

# Deploy with secrets and environment variables
gcloud run services update my-api \
  --region us-central1 \
  --set-env-vars "NODE_ENV=production,LOG_LEVEL=info,PORT=8080" \
  --set-secrets "DB_PASSWORD=db-password:latest" \
  --set-secrets "/secrets/api-key=external-api-key:latest" \
  --service-account "my-api-sa@my-project.iam.gserviceaccount.com"

# Use specific secret versions (recommended for reproducibility)
gcloud run services update my-api \
  --region us-central1 \
  --set-secrets "DB_PASSWORD=db-password:3"

Monitoring and Alerting

Cloud Run publishes metrics to Cloud Monitoring automatically, including request count, latency percentiles, instance count, CPU and memory utilization, and container startup latency. Set up dashboards and alerts for production visibility.

bash

# View recent logs
gcloud logging read 'resource.type="cloud_run_revision"
  AND resource.labels.service_name="my-api"
  AND severity>=ERROR' \
  --limit 20 \
  --format="table(timestamp, severity, textPayload)"

# Create an alert for high error rate
gcloud monitoring policies create \
  --display-name "Cloud Run High Error Rate" \
  --condition-display-name "5xx error rate > 1%" \
  --condition-filter 'resource.type="cloud_run_revision"
    AND resource.labels.service_name="my-api"
    AND metric.type="run.googleapis.com/request_count"
    AND metric.labels.response_code_class="5xx"' \
  --condition-threshold-value 0.01 \
  --condition-threshold-duration 300s \
  --aggregation-alignment-period 60s \
  --aggregation-per-series-aligner ALIGN_RATE \
  --notification-channels "projects/my-project/notificationChannels/123"

# Create an alert for high latency (p99 > 2 seconds)
gcloud monitoring policies create \
  --display-name "Cloud Run High Latency" \
  --condition-display-name "p99 latency > 2s" \
  --condition-filter 'resource.type="cloud_run_revision"
    AND resource.labels.service_name="my-api"
    AND metric.type="run.googleapis.com/request_latencies"' \
  --condition-threshold-value 2000 \
  --condition-threshold-duration 300s \
  --notification-channels "projects/my-project/notificationChannels/123"

# Enable Cloud Trace for distributed tracing
gcloud run services update my-api \
  --region us-central1 \
  --set-env-vars "GOOGLE_CLOUD_TRACE_ENABLED=true"

# Create a custom dashboard
gcloud monitoring dashboards create --config='{
  "displayName": "Cloud Run - my-api",
  "gridLayout": {
    "widgets": [
      {
        "title": "Request Count",
        "xyChart": {
          "dataSets": [{
            "timeSeriesQuery": {
              "timeSeriesFilter": {
                "filter": "resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/request_count"",
                "aggregation": {"alignmentPeriod": "60s", "perSeriesAligner": "ALIGN_RATE"}
              }
            }
          }]
        }
      }
    ]
  }
}'

Production Deployment Checklist

Before going to production, ensure your Cloud Run service is configured correctly. Use this checklist as a starting point and customize for your requirements.

Configuration	Development	Production
Min instances	0	2+ (per region)
Max instances	10	100+ (based on load testing)
CPU allocation	Request-based	Always allocated
Concurrency	Default (80)	Tuned to workload type
Execution environment	Gen1	Gen2 (better performance)
VPC connectivity	None or connector	Direct VPC egress
Secrets	Env vars	Secret Manager
Domain	*.run.app	Custom domain + LB
Authentication	Allow unauthenticated	Require authentication or LB + IAP
Health checks	None	Startup + liveness probes
Monitoring	Default metrics	Custom alerts + dashboards

Request Timeout

Cloud Run has a maximum request timeout of 60 minutes (default 5 minutes). For long-running operations like file processing or report generation, either increase the timeout or use Cloud Tasks to offload work to a background Cloud Run service. Cloud Tasks provides at-least-once delivery, automatic retries, and rate limiting, making it ideal for workloads that should not run within a synchronous HTTP request.

Next Steps

With your Cloud Run service configured for production, explore these advanced patterns:

Cloud Run Jobs: For batch processing and scheduled tasks, use Cloud Run Jobs instead of Services. Jobs run to completion and exit, ideal for ETL pipelines, data migration, and periodic maintenance.

Multi-region deployment: Deploy your service in multiple regions with a global load balancer for low-latency worldwide access and regional failover.

Cloud Run with GKE: For services that need GPU access, persistent volumes, or sidecar containers, deploy Cloud Run on GKE Autopilot using Knative serving.

Event-driven architecture: Trigger Cloud Run services from Pub/Sub messages, Eventarc events, or Cloud Scheduler for event-driven processing without maintaining polling infrastructure.

GCP Shared VPC Design GCP Gemini & Vertex AI Guide AI Services Across Clouds

Key Takeaways

1Minimum instances eliminate cold starts but cost 10% of active CPU rate when idle.
2Concurrency should be tuned based on workload type: high for I/O-bound, low for CPU-bound.
3Always-allocated CPU is required for connection pools, background tasks, and in-memory caching.
4Direct VPC egress is the recommended approach for connecting to private resources.
5Traffic splitting between revisions enables canary deployments and instant rollbacks.
6A Global Application Load Balancer adds CDN, WAF, and multi-region routing capabilities.

Frequently Asked Questions

How do I eliminate cold starts on Cloud Run?

Set min-instances to at least 2 for production services. Additionally, optimize container startup time: use smaller base images, minimize dependencies, defer heavy initialization, and use startup probes. A well-optimized Go service starts in under 100ms.

What concurrency should I set?

For I/O-bound APIs (database queries, external calls): 80-250. For CPU-bound work (ML inference, image processing): 1-4. For web applications with server-side rendering: 20-80. Test with load testing to find the optimal value for your workload.

Should I use always-allocated CPU?

Yes for production APIs. Request-based CPU allocation (default) removes CPU between requests, breaking connection pools, WebSocket connections, and in-memory caches. Always-allocated CPU costs more but provides consistent behavior.

How do I connect Cloud Run to Cloud SQL?

Use the Cloud SQL Auth Proxy built into Cloud Run (--add-cloudsql-instances flag) for public IP connections, or use VPC connector/Direct VPC egress with the database's private IP. Private IP is recommended for production.

Written by CloudToolStack Editorial

Written and reviewed by the CloudToolStack editorial team. Every guide is verified against current provider documentation and revised in place when providers change pricing, deprecate services, or release meaningfully better alternatives.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.