Estimate Vertex AI costs for Gemini API tokens, custom training with GPUs, prediction endpoints, and AutoML.
Last verified: May 2026
1.0M tokens
0.5M tokens
Gemini 2.0 Flash is the most cost-effective for high-volume workloads at $0.10/1M input tokens. Gemini 1.5 Pro offers best quality for complex tasks at $1.25-$2.50/1M input tokens depending on context length. Gemini 1.5 Flash balances speed and cost at $0.075/1M input tokens. Prompts exceeding 128K tokens use long-context pricing (2x rates).
T4 ($0.35/hr): inference and light training. V100 ($2.48/hr): general-purpose training. A100 40GB ($2.95/hr): large model training with high memory bandwidth. A100 80GB ($3.67/hr): for models exceeding 40GB GPU memory. H100 ($12.24/hr): latest generation, best for LLM fine-tuning and large-scale training with up to 3x throughput over A100.
Use preemptible VMs for up to 60-91% savings on training jobs that can tolerate interruption. Start with smaller machine types and scale up based on GPU utilization metrics. Use spot instances for hyperparameter tuning jobs. Consider distributed training across multiple smaller GPUs instead of a single large GPU for better cost-performance ratio.
Configure min/max replicas to balance cost and latency. Set minimum replicas to handle baseline traffic without cold starts. Use scale-to-zero for development endpoints to avoid idle costs. Monitor CPU/GPU utilization to right-size machine types. Consider traffic splitting to gradually shift traffic to new model versions.
Output will appear here...The Vertex AI Cost Estimator helps you project monthly costs for Google Cloud's Vertex AI platform including Gemini API token usage, custom model training with GPUs, online and batch prediction endpoints, and AutoML training. Each pricing dimension varies significantly, and this tool consolidates them into a single estimate with component-level breakdowns.
Gemini API charges per million tokens for input (prompt) and output (completion) separately. Pricing varies by model: Gemini 1.5 Flash is the most economical, Gemini 1.5 Pro is mid-range, and Gemini Ultra is premium. Context caching reduces input costs for repeated prefixes. Provisioned throughput is available for guaranteed capacity.
Custom training charges per accelerator-hour (GPU or TPU) and per machine-hour (vCPU and memory). A training job running 4 A100 GPUs for 10 hours costs 40 accelerator-hours plus the machine-hour charges. Preemptible training VMs offer up to 60% savings but may be interrupted.
Online prediction endpoints run continuously, charging per node-hour for the deployed machine type. Batch prediction processes large datasets in bulk and charges per node-hour only during processing. Use online prediction for real-time inference and batch prediction for large-scale offline scoring.
Your team is building a document Q&A app over a 10K-document corpus. Initial design: every query embeds the full document context (avg 50K tokens) + Gemini 1.5 Pro for generation. Cost projection: 100K queries/month × 50K input tokens × $1.25/M = $6,250/month. The estimator suggests two optimizations: (1) Context caching for the document corpus saves 75% on input tokens = $4,700/month savings. (2) Switching to Gemini 1.5 Flash for tier-1 questions, falling back to Pro for complex queries = additional 60% reduction. Final cost: ~$650/month, a 90% reduction from baseline.
The estimator handles each Vertex AI service category separately: Gemini API (input + output tokens × per-1M rate by model), custom training (machine-hours × rate + accelerator-hours × rate by GPU/TPU type), online prediction (machine-hours × rate, billed continuously), and batch prediction (machine-hours × rate during processing only). It applies free tier allowances and PTU discounts where applicable.
Gemini 1.5 Flash is roughly 25x cheaper than Gemini 1.5 Pro for input tokens and benchmarks competitively for most enterprise use cases. Always start with Flash and upgrade to Pro only when measurable quality issues appear — not before. Many teams default to Pro 'just to be safe' and spend 25x what they need to.
Context caching for repeated prefixes (e.g., long system prompts, retrieved RAG context) can reduce input token costs by 75%+. The minimum cached context is 32K tokens. For RAG architectures with consistent system prompts, this is a near-free optimization once enabled.
Online prediction endpoints bill for the deployed machine 24/7 even at zero QPS. A common mistake is deploying a model to a multi-GPU endpoint for testing and forgetting to delete it. A single n1-standard-8 + T4 GPU left running for a month costs ~$400 with zero predictions served. Always set deletion alarms on prediction endpoints.
Was this tool helpful?
Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.