GCPComputeintermediate

GCP Gemini & Vertex AI Guide

Guide to Gemini and Vertex AI covering the Gemini API, multimodal capabilities, grounding, agents, fine-tuning, model evaluation, and production deployment.

CloudToolStack Editorial26 min readPublished Mar 14, 2026

Prerequisites

GCP project with Vertex AI API enabled
Basic understanding of LLMs and prompt engineering
Familiarity with GCP IAM and service accounts

Introduction to Gemini and Vertex AI

Google Cloud's AI platform centers on two pillars: Gemini, Google's most capable family of multimodal foundation models, and Vertex AI, the managed ML platform that hosts these models alongside tools for fine-tuning, evaluation, deployment, and orchestration. Together they provide a comprehensive stack for building AI applications on Google Cloud.

Gemini models are natively multimodal, meaning they can process text, images, audio, video, and code in a single model without separate encoding pipelines. The Gemini 2.0 family includes Flash (fast and cost-effective), Pro (balanced), and Ultra (highest capability). Vertex AI hosts these models and adds enterprise features: VPC Service Controls for network isolation, Customer-Managed Encryption Keys (CMEK), IAM-based access control, and integration with BigQuery, Cloud Storage, and other GCP services.

This guide covers the full Gemini on Vertex AI workflow: accessing the API, using multimodal capabilities, grounding responses with your data, building agents, fine-tuning models, evaluating quality, and deploying to production with proper security and monitoring.

Gemini API vs. Vertex AI API

Google offers Gemini through two APIs: the Gemini API (via Google AI Studio, using API keys, similar to OpenAI's direct API) and theVertex AI API (using GCP project authentication and IAM). For production workloads, always use the Vertex AI API because it provides enterprise security, VPC Service Controls, data residency guarantees, and SLA coverage. The Gemini API is designed for prototyping and personal projects.

Setting Up Vertex AI

Vertex AI is enabled per GCP project. You need to enable the Vertex AI API, set up appropriate IAM permissions, and optionally configure a VPC Service Controls perimeter for production environments.

bash

# Enable required APIs
gcloud services enable aiplatform.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable bigquery.googleapis.com

# Set your project and region
gcloud config set project my-ai-project
gcloud config set ai/region us-central1

# Grant Vertex AI user role
gcloud projects add-iam-policy-binding my-ai-project \
  --member="user:developer@company.com" \
  --role="roles/aiplatform.user"

# For service accounts (CI/CD, applications)
gcloud iam service-accounts create vertex-ai-app \
  --display-name="Vertex AI Application"

gcloud projects add-iam-policy-binding my-ai-project \
  --member="serviceAccount:vertex-ai-app@my-ai-project.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

# List available Gemini models
gcloud ai models list \
  --region=us-central1 \
  --filter="displayName~gemini" \
  --format="table(displayName, name)"

# Test with a simple prediction using gcloud
gcloud ai endpoints predict \
  --region=us-central1 \
  --endpoint=publishers/google/models/gemini-2.0-flash \
  --json-request='{
    "instances": [{
      "content": "Explain Kubernetes pod affinity in one paragraph."
    }]
  }'

Calling the Gemini API with Python

The Vertex AI SDK for Python provides a high-level interface for interacting with Gemini models. It handles authentication, retry logic, and response parsing. TheGenerativeModel class is the primary entry point.

python

import vertexai
from vertexai.generative_models import GenerativeModel, Part, SafetySetting, HarmCategory, HarmBlockThreshold

# Initialize Vertex AI
vertexai.init(project="my-ai-project", location="us-central1")

# Create a model instance
model = GenerativeModel(
    model_name="gemini-2.0-flash",
    system_instruction="You are a cloud infrastructure expert. Provide concise, technical answers with code examples when appropriate.",
)

# Simple text generation
response = model.generate_content(
    "What are the best practices for organizing GCP projects and folders?",
    generation_config={
        "temperature": 0.3,
        "top_p": 0.95,
        "top_k": 40,
        "max_output_tokens": 1024,
    },
)
print(response.text)
print(f"Tokens: {response.usage_metadata.prompt_token_count} in, "
      f"{response.usage_metadata.candidates_token_count} out")

# Multi-turn conversation using Chat
chat = model.start_chat(history=[])

response1 = chat.send_message("How do I set up a GKE cluster with Workload Identity?")
print(response1.text)

response2 = chat.send_message("Now show me how to deploy a pod that uses that Workload Identity.")
print(response2.text)

# Streaming response
response = model.generate_content(
    "Write a Terraform module for a GCP Cloud Run service with a custom domain.",
    stream=True,
)
for chunk in response:
    print(chunk.text, end="", flush=True)

Multimodal Capabilities

Gemini's native multimodal understanding is its key differentiator. You can send images, PDFs, audio files, and even video alongside text prompts. The model processes all modalities together, enabling use cases like visual question answering, document analysis, code screenshot interpretation, and video summarization.

python

from vertexai.generative_models import GenerativeModel, Part, Image

model = GenerativeModel("gemini-2.0-flash")

# Analyze an architecture diagram from Cloud Storage
image_part = Part.from_uri(
    uri="gs://my-bucket/architecture-diagrams/vpc-topology.png",
    mime_type="image/png",
)

response = model.generate_content([
    image_part,
    "Analyze this network architecture diagram. Identify potential security issues, "
    "suggest improvements, and estimate the monthly cost of the shown infrastructure."
])
print(response.text)

# Analyze a PDF document
pdf_part = Part.from_uri(
    uri="gs://my-bucket/documents/incident-report.pdf",
    mime_type="application/pdf",
)

response = model.generate_content([
    pdf_part,
    "Summarize this incident report. Extract: root cause, impact, timeline, "
    "and remediation steps. Format as a structured summary."
])
print(response.text)

# Analyze video content
video_part = Part.from_uri(
    uri="gs://my-bucket/recordings/deployment-demo.mp4",
    mime_type="video/mp4",
)

response = model.generate_content([
    video_part,
    "Watch this deployment walkthrough video. Create step-by-step documentation "
    "that someone could follow to reproduce this deployment."
])
print(response.text)

# Multiple images: compare before/after
before = Part.from_uri("gs://my-bucket/dashboards/before.png", "image/png")
after = Part.from_uri("gs://my-bucket/dashboards/after.png", "image/png")

response = model.generate_content([
    "Compare these two monitoring dashboard screenshots. ",
    "Before:", before,
    "After:", after,
    "What metrics changed? Are any changes concerning?"
])
print(response.text)

Multimodal Token Costs

Images are tokenized based on their resolution. A 1024x1024 image costs approximately 258 tokens. Video is processed at 1 frame per second, so a 60-second video uses about 15,000 tokens. Audio is tokenized at approximately 32 tokens per second. Always consider these costs when designing multimodal pipelines, and resize images to the minimum resolution needed for your use case.

Grounding and RAG

Grounding connects Gemini to external data sources so responses are based on current, factual information rather than the model's training data. Vertex AI supports multiple grounding strategies: Google Search grounding (for up-to-date web information), Vertex AI Search grounding (for your own document corpus), and custom RAG with your own vector store.

python

from vertexai.generative_models import GenerativeModel, Tool, grounding

model = GenerativeModel("gemini-2.0-flash")

# Grounding with Google Search
search_tool = Tool.from_google_search_retrieval(
    grounding.GoogleSearchRetrieval()
)

response = model.generate_content(
    "What are the latest GCP pricing changes announced this month?",
    tools=[search_tool],
)
print(response.text)

# Check grounding metadata
if response.candidates[0].grounding_metadata:
    for chunk in response.candidates[0].grounding_metadata.grounding_chunks:
        print(f"Source: {chunk.web.title} - {chunk.web.uri}")

# Grounding with Vertex AI Search (your own data)
from vertexai.preview.generative_models import grounding as grounding_preview

# First, create a Vertex AI Search data store and app
# (This is done in the console or via the Discovery Engine API)

search_tool = Tool.from_retrieval(
    grounding_preview.Retrieval(
        grounding_preview.VertexAISearch(
            datastore=f"projects/my-ai-project/locations/global/collections/default_collection/dataStores/my-docs-datastore"
        )
    )
)

response = model.generate_content(
    "What is our SLA for the payments API?",
    tools=[search_tool],
)
print(response.text)

# Custom RAG with Vector Search
from google.cloud import aiplatform

# Create an index for vector search
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="docs-vector-index",
    contents_delta_uri="gs://my-bucket/embeddings/",
    dimensions=768,
    approximate_neighbors_count=10,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
)

# Create an endpoint and deploy the index
my_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="docs-search-endpoint",
    public_endpoint_enabled=True,
)

my_endpoint.deploy_index(
    index=my_index,
    deployed_index_id="docs-deployed-index",
    machine_type="e2-standard-2",
    min_replica_count=1,
    max_replica_count=2,
)

Building Agents with Vertex AI

Vertex AI Agent Builder lets you create AI agents that can use tools, access data stores, and maintain conversation context. Agents combine the reasoning capabilities of Gemini with the ability to take actions through function calling.

python

from vertexai.generative_models import GenerativeModel, FunctionDeclaration, Tool

# Define tools the agent can use
get_instances = FunctionDeclaration(
    name="get_gce_instances",
    description="Lists Google Compute Engine instances in a project",
    parameters={
        "type": "object",
        "properties": {
            "project_id": {
                "type": "string",
                "description": "GCP project ID"
            },
            "zone": {
                "type": "string",
                "description": "Compute zone (e.g., us-central1-a)"
            },
            "status": {
                "type": "string",
                "enum": ["RUNNING", "STOPPED", "TERMINATED"],
                "description": "Filter by instance status"
            }
        },
        "required": ["project_id"]
    }
)

get_metrics = FunctionDeclaration(
    name="get_monitoring_metrics",
    description="Retrieves Cloud Monitoring metrics for a resource",
    parameters={
        "type": "object",
        "properties": {
            "project_id": {"type": "string"},
            "metric_type": {
                "type": "string",
                "description": "Metric type (e.g., compute.googleapis.com/instance/cpu/utilization)"
            },
            "resource_name": {"type": "string"},
            "duration_minutes": {"type": "integer", "default": 60}
        },
        "required": ["project_id", "metric_type", "resource_name"]
    }
)

# Create a tool with both functions
ops_tool = Tool(function_declarations=[get_instances, get_metrics])

# Initialize the model with tools
model = GenerativeModel(
    "gemini-2.0-flash",
    system_instruction="You are a GCP operations assistant. Use the provided tools to look up real infrastructure data before answering questions.",
    tools=[ops_tool],
)

chat = model.start_chat()

# The model will decide when to call tools
response = chat.send_message(
    "What instances are running in project 'prod-web' and what is their CPU usage?"
)

# Check for function calls in the response
for candidate in response.candidates:
    for part in candidate.content.parts:
        if part.function_call:
            fn_name = part.function_call.name
            fn_args = dict(part.function_call.args)
            print(f"Agent wants to call: {fn_name}({fn_args})")

            # Execute the function and return results
            # (Your implementation here)
            if fn_name == "get_gce_instances":
                result = {
                    "instances": [
                        {"name": "web-server-1", "status": "RUNNING", "zone": "us-central1-a"},
                        {"name": "web-server-2", "status": "RUNNING", "zone": "us-central1-b"},
                    ]
                }

            # Send function response back to the model
            from vertexai.generative_models import Part
            response = chat.send_message(
                Part.from_function_response(
                    name=fn_name,
                    response={"result": result}
                )
            )
            print(response.text)

Fine-Tuning Gemini Models

Vertex AI supports supervised fine-tuning of Gemini models to improve performance on domain-specific tasks. Fine-tuning adjusts the model's weights using your labeled training data, resulting in better accuracy, more consistent formatting, and reduced prompt engineering effort. Fine-tuning is available for Gemini 1.5 Flash and Pro models.

python

from vertexai.preview.tuning import sft

# Prepare training data in JSONL format
# Each line: {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "model", "content": "..."}]}

# Upload training data to Cloud Storage
# gsutil cp training_data.jsonl gs://my-bucket/fine-tuning/training_data.jsonl

# Start a supervised fine-tuning job
tuning_job = sft.train(
    source_model="gemini-1.5-flash-002",
    train_dataset="gs://my-bucket/fine-tuning/training_data.jsonl",
    validation_dataset="gs://my-bucket/fine-tuning/validation_data.jsonl",
    epochs=3,
    adapter_size=4,  # LoRA rank: 1, 4, 8, or 16
    learning_rate_multiplier=1.0,
    tuned_model_display_name="support-classifier-v1",
)

# Monitor the job
print(f"Job: {tuning_job.resource_name}")
print(f"Status: {tuning_job.state}")

# Wait for completion (can take hours)
tuning_job.wait()

# Use the fine-tuned model
tuned_model = GenerativeModel(tuning_job.tuned_model_endpoint_name)
response = tuned_model.generate_content("Classify this support ticket: ...")
print(response.text)

Fine-Tuning Data Requirements

Effective fine-tuning requires at least 100 high-quality training examples (500+ recommended). The training data must be representative of your actual use case. Poor-quality or biased training data will degrade model performance. Always include a validation dataset (10-20% of training data) to monitor for overfitting. Fine-tuning costs approximately $4/hour for Flash and $8/hour for Pro models.

Model Evaluation

Vertex AI provides built-in evaluation tools to measure model quality across different metrics depending on your task type. For generation tasks, it measures ROUGE, BLEU, and fluency. For classification, it measures precision, recall, and F1. For summarization, it provides extractive and abstractive quality scores. You can also run custom evaluations using LLM-as-judge patterns.

python

from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples

# Define an evaluation task
eval_task = EvalTask(
    dataset=[
        {
            "prompt": "What is the maximum number of VPCs per GCP project?",
            "reference": "The default VPC quota is 15 per project, but this can be increased by requesting a quota increase.",
        },
        {
            "prompt": "How do you enable VPC Flow Logs?",
            "reference": "Enable VPC Flow Logs on a subnet by setting the --enable-flow-logs flag when creating or updating the subnet using gcloud compute networks subnets update.",
        },
    ],
    metrics=[
        "rouge_l_sum",
        "bleu",
        MetricPromptTemplateExamples.Pointwise.FLUENCY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
    ],
    experiment="model-quality-eval",
)

# Run evaluation against a model
result = eval_task.evaluate(
    model=GenerativeModel("gemini-2.0-flash"),
)

print(result.summary_metrics)
print(result.metrics_table)

Safety Settings and Content Filtering

Gemini includes configurable safety filters that block content across categories: harassment, hate speech, sexually explicit content, and dangerous content. Each category can be set to block at different threshold levels. For enterprise applications, you should configure these based on your use case and compliance requirements.

python

from vertexai.generative_models import (
    GenerativeModel,
    SafetySetting,
    HarmCategory,
    HarmBlockThreshold,
)

# Configure safety settings
safety_settings = [
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=HarmBlockThreshold.BLOCK_ONLY_HIGH,
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    ),
]

model = GenerativeModel("gemini-2.0-flash")

response = model.generate_content(
    "Explain network penetration testing methodology",
    safety_settings=safety_settings,
)

# Check if response was blocked
if response.candidates[0].finish_reason.name == "SAFETY":
    print("Response blocked by safety filter")
    for rating in response.candidates[0].safety_ratings:
        print(f"  {rating.category.name}: {rating.probability.name}")
else:
    print(response.text)

Monitoring and Cost Optimization

Vertex AI publishes metrics to Cloud Monitoring and logs to Cloud Logging. Monitor prediction latency, error rates, and token usage to optimize performance and costs.

bash

# View Vertex AI prediction metrics
gcloud monitoring metrics list \
  --filter="metric.type = starts_with(\"aiplatform.googleapis.com/prediction\")"

# Create an alert for high latency
gcloud monitoring policies create \
  --display-name="Gemini High Latency Alert" \
  --condition-display-name="Prediction latency > 5s" \
  --condition-filter='resource.type="aiplatform.googleapis.com/Endpoint" AND metric.type="aiplatform.googleapis.com/prediction/online/prediction_latencies"' \
  --condition-threshold-value=5000 \
  --condition-threshold-duration=300s \
  --notification-channels="projects/my-project/notificationChannels/123"

# Query usage logs in Cloud Logging
gcloud logging read 'resource.type="aiplatform.googleapis.com/Endpoint"
  AND jsonPayload.modelId="gemini-2.0-flash"' \
  --limit=10 \
  --format="table(timestamp, jsonPayload.inputTokenCount, jsonPayload.outputTokenCount)"

# Export usage data to BigQuery for cost analysis
gcloud logging sinks create vertex-ai-usage-sink \
  bigquery.googleapis.com/projects/my-project/datasets/ai_usage \
  --log-filter='resource.type="aiplatform.googleapis.com/Endpoint"'

Model Pricing Comparison

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best For
Gemini 2.0 Flash	$0.10	$0.40	Most tasks, best price/performance
Gemini 1.5 Pro	$1.25	$5.00	Complex reasoning, large context
Gemini 1.5 Flash	$0.075	$0.30	High volume, simple tasks
text-embedding-005	$0.00001	N/A	Embeddings (extremely cheap)

Context Caching for Cost Reduction

Gemini supports context caching, which stores frequently used context (system prompts, large documents) and reuses it across requests. Cached input tokens cost 75% less than regular input tokens. This is particularly effective for RAG applications where the same document context is used across many user queries. Set a minimum cache TTL of 5 minutes.

Next Steps

Start by exploring Vertex AI Studio in the GCP console, which provides an interactive playground for testing Gemini models with different prompts, parameters, and grounding configurations. Then build a simple application using the Python SDK, add grounding with Vertex AI Search for RAG, and implement proper monitoring before moving to production.

For advanced use cases, explore the Model Garden which provides access to over 150 open-source and third-party models (Llama, Mistral, Stable Diffusion) that you can deploy on Vertex AI alongside Gemini. Consider using Vertex AI Pipelines for ML workflows and Vertex AI Experiments for systematic model evaluation.

GCP Shared VPC Design Cloud Run for Production AI Services Across Clouds

Key Takeaways

1Gemini models are natively multimodal, processing text, images, audio, video, and PDFs in a single model.
2Use the Vertex AI API (not the Gemini API) for production workloads to get enterprise security and SLA.
3Grounding connects Gemini to Google Search or your own data via Vertex AI Search for factual responses.
4Function calling enables agents that can interact with external systems and APIs.
5Fine-tuning with LoRA adapters requires as few as 100 examples and runs on Vertex AI infrastructure.
6Context caching reduces costs by 75% for frequently used system prompts and documents.

Frequently Asked Questions

What is the difference between Gemini API and Vertex AI API?

The Gemini API (via Google AI Studio) uses API keys and is designed for prototyping. The Vertex AI API uses GCP IAM authentication and provides enterprise features: VPC Service Controls, CMEK, audit logging, and SLA. Always use Vertex AI API for production.

Which Gemini model should I use?

Gemini 2.0 Flash for most tasks (fastest, cheapest, very capable). Gemini 1.5 Pro for complex reasoning and large context windows (up to 2M tokens). Gemini 1.5 Flash for high-volume simple tasks at lowest cost.

Can Gemini process video?

Yes. Gemini natively processes video at 1 frame per second. Upload video to Cloud Storage and reference it in your prompt. A 60-second video uses approximately 15,000 tokens. This enables use cases like video summarization, content moderation, and documentation generation from screen recordings.

How does GCP Vertex AI pricing compare to competitors?

Gemini 2.0 Flash is the cheapest flagship model at $0.10/$0.40 per million tokens (input/output). GCP text-embedding-005 is essentially free. GCP is generally the most cost-effective option, especially for high-volume workloads.

Written by CloudToolStack Editorial

Written and reviewed by the CloudToolStack editorial team. Every guide is verified against current provider documentation and revised in place when providers change pricing, deprecate services, or release meaningfully better alternatives.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.