Skip to main content
AWSComputeintermediate

AWS Bedrock: Building AI Applications

Comprehensive guide to Amazon Bedrock covering model access, agents, knowledge bases, RAG, guardrails, fine-tuning, and production architecture patterns.

CloudToolStack Team25 min readPublished Mar 14, 2026

Prerequisites

Introduction to AWS Bedrock

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI companies through a single API. Rather than training models from scratch or managing GPU infrastructure, Bedrock lets you build generative AI applications by selecting a model, customizing it with your data, and integrating it into your application using familiar AWS tools. Bedrock supports models from Anthropic (Claude), Amazon (Titan), Meta (Llama), Mistral, Cohere, Stability AI, and AI21 Labs.

What makes Bedrock compelling for enterprise teams is its integration with the broader AWS ecosystem. Your data never leaves your AWS account, models are accessed through private VPC endpoints, and you can use IAM policies to control who can invoke which models. Bedrock also provides built-in features for responsible AI, including Guardrails to filter harmful content, and Knowledge Bases for retrieval-augmented generation (RAG) that ground model responses in your own data.

This guide covers everything you need to go from zero to production with Bedrock: enabling model access, invoking models programmatically, building conversational agents, setting up knowledge bases for RAG, configuring guardrails, and optimizing costs. Every section includes working CLI commands and code samples you can run immediately.

Bedrock Pricing Model

Bedrock uses pay-per-use pricing based on input and output tokens. There are no upfront commitments or minimum fees. For example, Claude 3.5 Sonnet costs $3.00 per million input tokens and $15.00 per million output tokens. You can also purchase Provisioned Throughput for predictable workloads at a discount. Always check the current pricing page since model costs change frequently.

Enabling Model Access

Before you can invoke any foundation model, you must explicitly request access in the Bedrock console. This is a one-time setup per model per region. AWS requires this step because some model providers have specific usage policies and terms of service you must accept.

Navigate to the Amazon Bedrock console, select "Model access" from the left sidebar, and click "Manage model access." Check the boxes for the models you want to use and submit the request. Most models are approved instantly, but some (like certain Anthropic or Meta models) may require a brief review period.

bash
# List all available foundation models
aws bedrock list-foundation-models \
  --query 'modelSummaries[].{Provider:providerName, Model:modelId, Input:inputModalities, Output:outputModalities}' \
  --output table

# Check which models you have access to
aws bedrock list-foundation-models \
  --query 'modelSummaries[?modelLifecycle.status==`ACTIVE`].{Model:modelId, Provider:providerName}' \
  --output table

# Get details about a specific model
aws bedrock get-foundation-model \
  --model-identifier anthropic.claude-3-5-sonnet-20241022-v2:0

Region Availability

Not all models are available in every AWS region. Claude models are generally available in us-east-1, us-west-2, and eu-west-1. Check the Bedrock documentation for the latest region-model availability matrix. If you need a specific model, ensure you are working in a supported region before building your application.

Invoking Models with the API

Bedrock provides two primary APIs for model invocation: InvokeModel for synchronous single-turn requests and Converse for multi-turn conversations. The Converse API is the recommended approach because it provides a unified interface across all models, handles message formatting automatically, and supports tool use (function calling) natively.

bash
# Simple invocation using the Converse API (recommended)
aws bedrock-runtime converse \
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --messages '[{
    "role": "user",
    "content": [{"text": "Explain the difference between S3 and EBS in two sentences."}]
  }]' \
  --inference-config '{"maxTokens": 256, "temperature": 0.3}'

# Stream responses for real-time output
aws bedrock-runtime converse-stream \
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --messages '[{
    "role": "user",
    "content": [{"text": "Write a Python function to list all S3 buckets."}]
  }]' \
  --inference-config '{"maxTokens": 1024, "temperature": 0.2}'

Python SDK Integration

python
import boto3
import json

# Initialize the Bedrock Runtime client
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

# Using the Converse API (recommended for all models)
response = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[
        {
            "role": "user",
            "content": [
                {"text": "What are the top 3 AWS services for building serverless applications?"}
            ],
        }
    ],
    inferenceConfig={
        "maxTokens": 512,
        "temperature": 0.3,
        "topP": 0.9,
    },
    system=[{"text": "You are a helpful AWS solutions architect. Be concise and specific."}],
)

# Extract the response text
output_message = response["output"]["message"]
print(output_message["content"][0]["text"])

# Check token usage
usage = response["usage"]
print(f"Input tokens: {usage['inputTokens']}, Output tokens: {usage['outputTokens']}")

Use the Converse API, Not InvokeModel

The older InvokeModel API requires model-specific request/response formatting (each provider has a different JSON schema). The Converse API uses a unified format across all models, making it easy to switch between providers without changing your code. Always prefer Converse for new projects.

Building Conversational Agents

Bedrock Agents extend foundation models with the ability to take actions. An agent can break down a user request into steps, call external APIs or Lambda functions to gather information, and synthesize a final response. This is essential for building applications that go beyond simple question-answering, such as customer service bots that can look up orders, IT assistants that can query monitoring systems, or data analysts that can run SQL queries.

Agents use a technique called ReAct (Reasoning and Acting) where the model reasons about what to do, executes an action, observes the result, and repeats until the task is complete. You define the available actions as Action Groups, each backed by a Lambda function or an OpenAPI schema.

python
import boto3

bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")

# Create an agent
response = bedrock_agent.create_agent(
    agentName="cloud-ops-assistant",
    agentResourceRoleArn="arn:aws:iam::123456789012:role/BedrockAgentRole",
    foundationModel="anthropic.claude-3-5-sonnet-20241022-v2:0",
    instruction="""You are a cloud operations assistant. You help engineers
    check the status of their AWS resources, look up CloudWatch metrics,
    and provide recommendations for cost optimization. Always verify
    information before making recommendations.""",
    idleSessionTTLInSeconds=600,
)

agent_id = response["agent"]["agentId"]
print(f"Agent created: {agent_id}")

# Define an action group with an OpenAPI schema
bedrock_agent.create_agent_action_group(
    agentId=agent_id,
    agentVersion="DRAFT",
    actionGroupName="resource-lookup",
    actionGroupExecutor={
        "lambda": "arn:aws:lambda:us-east-1:123456789012:function:resource-lookup"
    },
    apiSchema={
        "payload": json.dumps({
            "openapi": "3.0.0",
            "info": {"title": "Resource Lookup API", "version": "1.0"},
            "paths": {
                "/instances": {
                    "get": {
                        "summary": "List EC2 instances",
                        "operationId": "listInstances",
                        "parameters": [
                            {
                                "name": "state",
                                "in": "query",
                                "schema": {"type": "string"},
                                "description": "Filter by instance state (running, stopped)"
                            }
                        ],
                        "responses": {"200": {"description": "List of instances"}}
                    }
                }
            }
        })
    },
)

# Prepare and create an agent alias for invocation
bedrock_agent.prepare_agent(agentId=agent_id)
# After preparation completes:
bedrock_agent.create_agent_alias(
    agentId=agent_id,
    agentAliasName="production"
)

Knowledge Bases and RAG

Retrieval-Augmented Generation (RAG) is the most important pattern for enterprise AI applications. Instead of relying solely on a model's training data, RAG retrieves relevant documents from your own data sources and includes them in the prompt. This grounds the model's responses in your actual documentation, reducing hallucinations and ensuring answers reflect your organization's specific context.

Bedrock Knowledge Bases automate the entire RAG pipeline: ingesting documents from S3, chunking them into appropriate segments, generating vector embeddings, storing them in a vector database, and retrieving relevant chunks at query time. Supported vector stores include Amazon OpenSearch Serverless, Amazon Aurora PostgreSQL (with pgvector), Pinecone, and Redis Enterprise Cloud.

python
import boto3

bedrock_agent = boto3.client("bedrock-agent", region_name="us-east-1")

# Create a knowledge base with OpenSearch Serverless
response = bedrock_agent.create_knowledge_base(
    name="company-docs-kb",
    description="Internal documentation and runbooks",
    roleArn="arn:aws:iam::123456789012:role/BedrockKBRole",
    knowledgeBaseConfiguration={
        "type": "VECTOR",
        "vectorKnowledgeBaseConfiguration": {
            "embeddingModelArn": "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
        },
    },
    storageConfiguration={
        "type": "OPENSEARCH_SERVERLESS",
        "opensearchServerlessConfiguration": {
            "collectionArn": "arn:aws:aoss:us-east-1:123456789012:collection/abc123",
            "fieldMapping": {
                "metadataField": "metadata",
                "textField": "text",
                "vectorField": "vector",
            },
            "vectorIndexName": "company-docs-index",
        },
    },
)

kb_id = response["knowledgeBase"]["knowledgeBaseId"]

# Add an S3 data source
bedrock_agent.create_data_source(
    knowledgeBaseId=kb_id,
    name="docs-s3-source",
    dataSourceConfiguration={
        "type": "S3",
        "s3Configuration": {
            "bucketArn": "arn:aws:s3:::company-documentation",
            "inclusionPrefixes": ["runbooks/", "architecture/", "guides/"],
        },
    },
    vectorIngestionConfiguration={
        "chunkingConfiguration": {
            "chunkingStrategy": "FIXED_SIZE",
            "fixedSizeChunkingConfiguration": {
                "maxTokens": 512,
                "overlapPercentage": 15,
            },
        }
    },
)

# Start ingestion (sync documents into the vector store)
bedrock_agent.start_ingestion_job(
    knowledgeBaseId=kb_id,
    dataSourceId="data-source-id"
)

Querying the Knowledge Base

python
# Query the knowledge base directly (retrieve + generate)
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

response = bedrock_agent_runtime.retrieve_and_generate(
    input={"text": "What is our disaster recovery procedure for the payments service?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": kb_id,
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,
                    "overrideSearchType": "HYBRID",
                }
            },
        },
    },
)

print(response["output"]["text"])

# Print source citations
for citation in response.get("citations", []):
    for ref in citation.get("retrievedReferences", []):
        source = ref["location"]["s3Location"]["uri"]
        print(f"  Source: {source}")

Chunking Strategy Matters

The chunking strategy significantly impacts RAG quality. Fixed-size chunking (300-500 tokens with 10-20% overlap) works for general documents. For structured documents like API references or runbooks, use semantic chunking which splits on natural boundaries like headings and paragraphs. Bedrock also supports hierarchical chunking that creates parent-child relationships between chunks for better context retrieval.

Guardrails for Responsible AI

Bedrock Guardrails let you implement safeguards for your generative AI applications. Guardrails sit between the user and the model, filtering both inputs (prompts) and outputs (responses) based on configurable policies. You can block harmful content, filter sensitive information like PII, enforce topic boundaries, and apply custom word filters.

Guardrails are essential for production applications where you need to ensure the model does not discuss off-topic subjects, reveal sensitive data, or generate inappropriate content. You define a guardrail once and apply it to any model invocation.

python
bedrock = boto3.client("bedrock", region_name="us-east-1")

# Create a guardrail
response = bedrock.create_guardrail(
    name="customer-service-guardrail",
    description="Guardrail for customer-facing AI assistant",
    # Content filters for harmful categories
    contentPolicyConfig={
        "filtersConfig": [
            {"type": "SEXUAL", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "VIOLENCE", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "HATE", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "INSULTS", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "MISCONDUCT", "inputStrength": "HIGH", "outputStrength": "HIGH"},
            {"type": "PROMPT_ATTACK", "inputStrength": "HIGH", "outputStrength": "NONE"},
        ]
    },
    # Block specific topics
    topicPolicyConfig={
        "topicsConfig": [
            {
                "name": "competitor-discussion",
                "definition": "Discussions comparing our products to competitors or recommending competitor products",
                "examples": [
                    "Is CompetitorX better than our product?",
                    "Should I switch to CompetitorY?",
                ],
                "type": "DENY",
            },
            {
                "name": "financial-advice",
                "definition": "Providing specific financial, investment, or tax advice",
                "examples": [
                    "Should I invest in stocks?",
                    "What tax deductions can I claim?",
                ],
                "type": "DENY",
            },
        ]
    },
    # Filter PII from inputs and outputs
    sensitiveInformationPolicyConfig={
        "piiEntitiesConfig": [
            {"type": "EMAIL", "action": "ANONYMIZE"},
            {"type": "PHONE", "action": "ANONYMIZE"},
            {"type": "US_SOCIAL_SECURITY_NUMBER", "action": "BLOCK"},
            {"type": "CREDIT_DEBIT_CARD_NUMBER", "action": "BLOCK"},
        ]
    },
    # Custom blocked words
    wordPolicyConfig={
        "wordsConfig": [
            {"text": "internal-secret-project"},
        ],
        "managedWordListsConfig": [
            {"type": "PROFANITY"},
        ],
    },
    blockedInputMessaging="I cannot process this request. Please rephrase without sensitive information.",
    blockedOutputMessaging="I cannot provide a response to this query. Please ask something else.",
)

guardrail_id = response["guardrailId"]

# Use the guardrail with model invocation
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

response = bedrock_runtime.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[
        {"role": "user", "content": [{"text": "Help me with my account question"}]}
    ],
    guardrailConfig={
        "guardrailIdentifier": guardrail_id,
        "guardrailVersion": "DRAFT",
    },
)

Model Customization and Fine-Tuning

While foundation models are powerful out of the box, you can improve their performance on domain-specific tasks through customization. Bedrock supports two customization approaches: fine-tuning (training the model on your labeled data to adjust its weights) and continued pre-training (exposing the model to unlabeled domain-specific text to expand its knowledge).

Fine-tuning is best when you need the model to follow a specific output format, adopt a particular writing style, or improve accuracy on a narrow task. Continued pre-training is useful when the model lacks knowledge about your domain (for example, proprietary technical terminology or internal processes).

python
# Prepare training data in JSONL format
# Each line: {"prompt": "...", "completion": "..."}
# Upload to S3: s3://my-bucket/training-data/fine-tune.jsonl

bedrock = boto3.client("bedrock", region_name="us-east-1")

# Create a fine-tuning job
response = bedrock.create_model_customization_job(
    jobName="support-classifier-v1",
    customModelName="support-ticket-classifier",
    roleArn="arn:aws:iam::123456789012:role/BedrockCustomizationRole",
    baseModelIdentifier="amazon.titan-text-express-v1",
    customizationType="FINE_TUNING",
    trainingDataConfig={
        "s3Uri": "s3://my-bucket/training-data/fine-tune.jsonl"
    },
    outputDataConfig={
        "s3Uri": "s3://my-bucket/model-output/"
    },
    hyperParameters={
        "epochCount": "3",
        "batchSize": "8",
        "learningRate": "0.00001",
        "learningRateWarmupSteps": "10",
    },
)

job_arn = response["jobArn"]
print(f"Fine-tuning job started: {job_arn}")

# Monitor the job
status = bedrock.get_model_customization_job(jobIdentifier=job_arn)
print(f"Status: {status['status']}")  # InProgress, Completed, Failed

# Once complete, create a provisioned throughput for the custom model
# (Required to use custom models - on-demand is not available)
bedrock.create_provisioned_model_throughput(
    provisionedModelName="support-classifier-pt",
    modelId="arn:aws:bedrock:us-east-1:123456789012:custom-model/support-ticket-classifier",
    modelUnits=1,
    commitmentDuration="OneMonth",  # or SixMonths for a discount
)

Fine-Tuning Costs

Fine-tuning incurs training costs based on the number of tokens processed and the model used. Custom models also require Provisioned Throughput to invoke (no on-demand pricing), starting at approximately $1,800/month for one model unit. Fine-tuning is most cost-effective when you have a high-volume use case where improved accuracy justifies the infrastructure cost. For many use cases, prompt engineering or RAG with a Knowledge Base is more cost-effective.

Tool Use and Function Calling

Tool use (also called function calling) allows models to request the execution of external functions during a conversation. You define the available tools with their parameters, the model decides when and how to call them based on the conversation context, and your application executes the function and returns the result. This is the foundation for building AI assistants that can interact with real systems.

python
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

# Define available tools
tool_config = {
    "tools": [
        {
            "toolSpec": {
                "name": "get_ec2_instances",
                "description": "Retrieves a list of EC2 instances with their current state",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "region": {
                                "type": "string",
                                "description": "AWS region (e.g., us-east-1)"
                            },
                            "state": {
                                "type": "string",
                                "enum": ["running", "stopped", "terminated"],
                                "description": "Filter by instance state"
                            }
                        },
                        "required": ["region"]
                    }
                }
            }
        },
        {
            "toolSpec": {
                "name": "get_cloudwatch_metric",
                "description": "Gets CloudWatch metric statistics for a resource",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "namespace": {"type": "string"},
                            "metric_name": {"type": "string"},
                            "instance_id": {"type": "string"},
                            "period_hours": {"type": "integer", "default": 1}
                        },
                        "required": ["namespace", "metric_name", "instance_id"]
                    }
                }
            }
        }
    ]
}

# Invoke with tool definitions
response = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[
        {
            "role": "user",
            "content": [{"text": "What is the CPU utilization of instance i-0abc123def456 in us-east-1?"}]
        }
    ],
    toolConfig=tool_config,
)

# Check if the model wants to use a tool
stop_reason = response["stopReason"]
if stop_reason == "tool_use":
    tool_block = response["output"]["message"]["content"]
    for block in tool_block:
        if "toolUse" in block:
            tool_name = block["toolUse"]["name"]
            tool_input = block["toolUse"]["input"]
            tool_use_id = block["toolUse"]["toolUseId"]
            print(f"Model wants to call: {tool_name}({tool_input})")

            # Execute the tool and return results
            # (Your implementation here)
            tool_result = {"cpu_utilization": 42.5, "period": "last_hour"}

            # Send tool result back to the model
            follow_up = bedrock.converse(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
                messages=[
                    {"role": "user", "content": [{"text": "What is the CPU utilization of instance i-0abc123def456?"}]},
                    {"role": "assistant", "content": tool_block},
                    {
                        "role": "user",
                        "content": [
                            {
                                "toolResult": {
                                    "toolUseId": tool_use_id,
                                    "content": [{"json": tool_result}],
                                }
                            }
                        ],
                    },
                ],
                toolConfig=tool_config,
            )
            print(follow_up["output"]["message"]["content"][0]["text"])

IAM Permissions for Bedrock

Bedrock uses standard IAM policies for access control. You need separate permissions for model invocation (runtime), model management (control plane), and agent/knowledge base operations. A well-designed permission model is critical because Bedrock can process sensitive data and generate content that your users will see.

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockModelInvocation",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-*",
        "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
      ]
    },
    {
      "Sid": "BedrockConverseAPI",
      "Effect": "Allow",
      "Action": [
        "bedrock:Converse",
        "bedrock:ConverseStream"
      ],
      "Resource": "*"
    },
    {
      "Sid": "BedrockKnowledgeBase",
      "Effect": "Allow",
      "Action": [
        "bedrock:Retrieve",
        "bedrock:RetrieveAndGenerate"
      ],
      "Resource": "arn:aws:bedrock:us-east-1:123456789012:knowledge-base/*"
    },
    {
      "Sid": "BedrockAgentInvocation",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeAgent"
      ],
      "Resource": "arn:aws:bedrock:us-east-1:123456789012:agent-alias/*"
    },
    {
      "Sid": "BedrockGuardrails",
      "Effect": "Allow",
      "Action": [
        "bedrock:ApplyGuardrail"
      ],
      "Resource": "arn:aws:bedrock:us-east-1:123456789012:guardrail/*"
    }
  ]
}

Use Resource-Level Permissions

Restrict model access to specific models using resource ARNs rather than wildcards. This prevents users from accidentally invoking expensive models. For example, you might allow developers to use Claude Haiku for testing but restrict Claude Opus to production workloads with a separate role.

Monitoring and Observability

Monitoring Bedrock usage is essential for cost management and performance optimization. Bedrock publishes metrics to CloudWatch, and you can enable model invocation logging to capture full request and response payloads for debugging and audit purposes.

bash
# Enable model invocation logging
aws bedrock put-model-invocation-logging-configuration \
  --logging-config '{
    "cloudWatchConfig": {
      "logGroupName": "/aws/bedrock/model-invocations",
      "roleArn": "arn:aws:iam::123456789012:role/BedrockLoggingRole",
      "largeDataDeliveryS3Config": {
        "bucketName": "bedrock-invocation-logs",
        "keyPrefix": "large-payloads/"
      }
    },
    "textDataDeliveryEnabled": true,
    "imageDataDeliveryEnabled": false,
    "embeddingDataDeliveryEnabled": false
  }'

# View Bedrock CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/Bedrock \
  --metric-name Invocations \
  --dimensions Name=ModelId,Value=anthropic.claude-3-5-sonnet-20241022-v2:0 \
  --start-time $(date -u -v-24H '+%Y-%m-%dT%H:%M:%S') \
  --end-time $(date -u '+%Y-%m-%dT%H:%M:%S') \
  --period 3600 \
  --statistics Sum \
  --output table

# Create a billing alarm for Bedrock spend
aws cloudwatch put-metric-alarm \
  --alarm-name bedrock-cost-alarm \
  --alarm-description "Alert when Bedrock costs exceed threshold" \
  --namespace AWS/Bedrock \
  --metric-name InvocationLatency \
  --statistic Average \
  --period 300 \
  --evaluation-periods 3 \
  --threshold 5000 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions "arn:aws:sns:us-east-1:123456789012:bedrock-alerts"

# Query invocation logs with CloudWatch Logs Insights
aws logs start-query \
  --log-group-name "/aws/bedrock/model-invocations" \
  --start-time $(date -u -v-1H '+%s') \
  --end-time $(date -u '+%s') \
  --query-string 'fields @timestamp, modelId, inputTokenCount, outputTokenCount
    | stats sum(inputTokenCount) as totalInput, sum(outputTokenCount) as totalOutput by modelId
    | sort totalOutput desc'

Cost Optimization Strategies

Bedrock costs can grow quickly in production, especially with large context windows and high request volumes. Understanding the pricing model and applying optimization strategies early can reduce your AI spend by 50-80% without sacrificing quality.

Model Selection by Use Case

Use CaseRecommended ModelWhy
Simple classification/routingClaude HaikuFast, cheap ($0.25/$1.25 per M tokens)
General Q&A, summarizationClaude SonnetBest quality/cost ratio
Complex reasoning, codingClaude OpusHighest capability, use selectively
EmbeddingsTitan Embed v2Low cost, good quality ($0.02 per M tokens)
High-volume, simple tasksTitan Text LiteLowest cost, adequate for simple tasks

Batch Inference for Non-Real-Time Workloads

If you have large datasets to process (document summarization, classification, extraction), use Bedrock Batch Inference. Submit jobs with thousands of prompts in a single request, and Bedrock processes them asynchronously at up to 50% lower cost than real-time invocations. Results are delivered to S3 when complete.

Production Architecture Patterns

Building production AI applications with Bedrock requires careful architecture decisions around reliability, security, and scalability. Here are the key patterns to follow.

VPC endpoints: Use a VPC endpoint for Bedrock to keep all traffic within the AWS network. This eliminates internet exposure for model invocations and is required by many compliance frameworks.

Caching: Implement response caching for common queries. Bedrock does not provide built-in caching, so use ElastiCache (Redis) or DynamoDB to cache responses keyed by a hash of the prompt and model parameters.

Rate limiting: Bedrock has per-model, per-region throttling limits. Implement client-side rate limiting and exponential backoff. For critical workloads, request quota increases through AWS Support or use Provisioned Throughput.

Multi-region failover: Deploy your application in multiple regions with a fallback model configuration. If us-east-1 throttles, automatically failover to us-west-2. Use Route 53 health checks to detect region-level issues.

python
import boto3
import hashlib
import json
from botocore.config import Config

# Configure retry behavior
bedrock_config = Config(
    retries={"max_attempts": 5, "mode": "adaptive"},
    read_timeout=120,
)

# Multi-region client setup
REGIONS = ["us-east-1", "us-west-2", "eu-west-1"]
clients = {
    region: boto3.client("bedrock-runtime", region_name=region, config=bedrock_config)
    for region in REGIONS
}

def invoke_with_failover(messages, model_id, max_tokens=1024):
    """Invoke Bedrock with automatic region failover."""
    last_error = None
    for region in REGIONS:
        try:
            response = clients[region].converse(
                modelId=model_id,
                messages=messages,
                inferenceConfig={"maxTokens": max_tokens},
            )
            return response
        except clients[region].exceptions.ThrottlingException as e:
            last_error = e
            print(f"Throttled in {region}, trying next region...")
            continue
        except Exception as e:
            last_error = e
            print(f"Error in {region}: {e}")
            continue
    raise last_error

Common Pitfalls and Best Practices

After deploying dozens of Bedrock applications, several patterns emerge that separate successful deployments from problematic ones. Here are the most important lessons learned.

Do not put sensitive data in prompts without guardrails. If your application passes user-provided data to the model, implement input validation and use Guardrails to filter PII. A user might accidentally paste credit card numbers or passwords into a chat interface.

Monitor token usage, not just request counts. A single request with a 100K-token context window costs more than 100 requests with 1K-token contexts. Track both input and output tokens per request to understand your true cost profile.

Use system prompts effectively. A well-crafted system prompt reduces output token usage by making the model more concise and focused. Include explicit format instructions, length constraints, and domain context in the system prompt.

Test with multiple models. Bedrock's unified Converse API makes it trivial to switch models. Benchmark your use case across Claude, Titan, Llama, and Mistral to find the best quality/cost tradeoff for your specific task.

Implement proper error handling. Bedrock can return throttling errors, model timeout errors, and validation errors. Always implement exponential backoff with jitter for throttling, and set appropriate read timeouts for long-running generations.

Data Residency

Bedrock processes data in the region where you make the API call. If you have data residency requirements (GDPR, data sovereignty), ensure you invoke models in a compliant region. Bedrock does not use your data to train or improve foundation models, but you should still review the data processing terms for each model provider.

Next Steps

You now have a solid foundation for building AI applications on AWS Bedrock. The key concepts to internalize are: use the Converse API for model invocation, Knowledge Bases for RAG, Guardrails for safety, and Agents for action-taking capabilities. Start with a simple use case (document Q&A is the most common), prove value with a prototype, and incrementally add sophistication.

For your next steps, explore the Bedrock Playground in the console to experiment with different models and prompts interactively. Build a simple RAG application using a Knowledge Base backed by your own documentation. Then add Guardrails to make it production-ready.

IAM Best Practices: Securing Your AWS AccountLambda Performance Tuning GuideAI Services Across Clouds: Bedrock vs Azure OpenAI vs Vertex AI

Key Takeaways

  1. 1Bedrock provides a unified API (Converse) for accessing multiple foundation models from different providers.
  2. 2Knowledge Bases automate the RAG pipeline: ingestion, chunking, embedding, and retrieval.
  3. 3Guardrails provide content filtering, PII detection, and topic restrictions independent of the model.
  4. 4Agents combine model reasoning with tool use for action-taking AI assistants.
  5. 5Fine-tuning requires Provisioned Throughput and is best reserved for high-volume, domain-specific use cases.
  6. 6Multi-region deployment with failover is essential for production reliability.

Frequently Asked Questions

What is the difference between Bedrock and SageMaker?
Bedrock is for consuming pre-built foundation models via API without managing infrastructure. SageMaker is for training, fine-tuning, and deploying custom ML models with full control over the training pipeline and hosting infrastructure. Use Bedrock for generative AI applications and SageMaker for custom ML workloads.
Which Bedrock model should I start with?
Start with Claude 3.5 Sonnet for general-purpose tasks. It offers the best quality-to-cost ratio. Use Haiku for simple classification and routing tasks where speed and cost matter most. Use the Converse API so you can easily switch models later.
Does Bedrock use my data for training?
No. AWS Bedrock does not use your inputs or outputs to train or improve foundation models. Your data stays within your AWS account and is encrypted in transit and at rest.
How do I reduce Bedrock costs?
Use the cheapest model that meets your quality requirements (Haiku for simple tasks, Sonnet for complex ones). Implement response caching for common queries. Use Batch Inference for non-real-time workloads at up to 50% discount. Minimize prompt sizes by removing unnecessary context.
Can I use Bedrock with a VPC endpoint?
Yes. Create a VPC endpoint for the bedrock-runtime service to keep all model invocation traffic within your VPC. This is recommended for production deployments and required by many compliance frameworks.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.