Vertex AI Cost Estimator

ComputeGCP

Estimate Vertex AI costs for Gemini API tokens, custom training with GPUs, prediction endpoints, and AutoML.

Last verified: May 2026

Vertex AI Services

Service 1

Service Type

Model

Input Tokens/Month (thousands)

1.0M tokens

Output Tokens/Month (thousands)

0.5M tokens

Vertex AI Pricing Guide

Gemini Model Comparison

Gemini 2.0 Flash is the most cost-effective for high-volume workloads at $0.10/1M input tokens. Gemini 1.5 Pro offers best quality for complex tasks at $1.25-$2.50/1M input tokens depending on context length. Gemini 1.5 Flash balances speed and cost at $0.075/1M input tokens. Prompts exceeding 128K tokens use long-context pricing (2x rates).

GPU Selection Guide

T4 ($0.35/hr): inference and light training. V100 ($2.48/hr): general-purpose training. A100 40GB ($2.95/hr): large model training with high memory bandwidth. A100 80GB ($3.67/hr): for models exceeding 40GB GPU memory. H100 ($12.24/hr): latest generation, best for LLM fine-tuning and large-scale training with up to 3x throughput over A100.

Training Cost Optimization

Use preemptible VMs for up to 60-91% savings on training jobs that can tolerate interruption. Start with smaller machine types and scale up based on GPU utilization metrics. Use spot instances for hyperparameter tuning jobs. Consider distributed training across multiple smaller GPUs instead of a single large GPU for better cost-performance ratio.

Prediction Autoscaling

Configure min/max replicas to balance cost and latency. Set minimum replicas to handle baseline traffic without cold starts. Use scale-to-zero for development endpoints to avoid idle costs. Monitor CPU/GPU utilization to right-size machine types. Consider traffic splitting to gradually shift traffic to new model versions.

Raw Output

Output will appear here...

How It Helps

The Vertex AI Cost Estimator helps you project monthly costs for Google Cloud's Vertex AI platform including Gemini API token usage, custom model training with GPUs, online and batch prediction endpoints, and AutoML training. Each pricing dimension varies significantly, and this tool consolidates them into a single estimate with component-level breakdowns.

Things Engineers Ask

How is Gemini API pricing structured?

Gemini API charges per million tokens for input (prompt) and output (completion) separately. Pricing varies by model: Gemini 1.5 Flash is the most economical, Gemini 1.5 Pro is mid-range, and Gemini Ultra is premium. Context caching reduces input costs for repeated prefixes. Provisioned throughput is available for guaranteed capacity.

How are custom training costs calculated?

Custom training charges per accelerator-hour (GPU or TPU) and per machine-hour (vCPU and memory). A training job running 4 A100 GPUs for 10 hours costs 40 accelerator-hours plus the machine-hour charges. Preemptible training VMs offer up to 60% savings but may be interrupted.

What is the difference between online and batch prediction?

Online prediction endpoints run continuously, charging per node-hour for the deployed machine type. Batch prediction processes large datasets in bulk and charges per node-hour only during processing. Use online prediction for real-time inference and batch prediction for large-scale offline scoring.

In Practice

Your team is building a document Q&A app over a 10K-document corpus. Initial design: every query embeds the full document context (avg 50K tokens) + Gemini 1.5 Pro for generation. Cost projection: 100K queries/month × 50K input tokens × $1.25/M = $6,250/month. The estimator suggests two optimizations: (1) Context caching for the document corpus saves 75% on input tokens = $4,700/month savings. (2) Switching to Gemini 1.5 Flash for tier-1 questions, falling back to Pro for complex queries = additional 60% reduction. Final cost: ~$650/month, a 90% reduction from baseline.

Practical Applications

1Estimating monthly Gemini API costs for a conversational AI application based on projected token volume.
2Calculating custom training costs for a machine learning experiment using A100 or T4 GPUs.
3Modeling online prediction endpoint costs based on expected QPS and machine type selection.
4Comparing AutoML training costs against custom training for tabular, image, and text models.

Behind the Scenes

The estimator handles each Vertex AI service category separately: Gemini API (input + output tokens × per-1M rate by model), custom training (machine-hours × rate + accelerator-hours × rate by GPU/TPU type), online prediction (machine-hours × rate, billed continuously), and batch prediction (machine-hours × rate during processing only). It applies free tier allowances and PTU discounts where applicable.

Things the Docs Don’t Tell You

TIP

Gemini 1.5 Flash is roughly 25x cheaper than Gemini 1.5 Pro for input tokens and benchmarks competitively for most enterprise use cases. Always start with Flash and upgrade to Pro only when measurable quality issues appear — not before. Many teams default to Pro 'just to be safe' and spend 25x what they need to.

TIP

Context caching for repeated prefixes (e.g., long system prompts, retrieved RAG context) can reduce input token costs by 75%+. The minimum cached context is 32K tokens. For RAG architectures with consistent system prompts, this is a near-free optimization once enabled.

TIP

Online prediction endpoints bill for the deployed machine 24/7 even at zero QPS. A common mistake is deploying a model to a multi-GPU endpoint for testing and forgetting to delete it. A single n1-standard-8 + T4 GPU left running for a month costs ~$400 with zero predictions served. Always set deletion alarms on prediction endpoints.

Related Learning Guides

Cost Optimization Guide24 min read

Featured In

GPU Cloud Pricing for ML Training: A100 vs H100 Across Clouds

Was this tool helpful?

Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.