Skip to main content
AzureComputeintermediate

Azure OpenAI Service Guide

Complete guide to Azure OpenAI Service covering deployment, GPT-4o, embeddings, RAG with Azure AI Search, content filtering, private networking, and cost optimization.

CloudToolStack Team24 min readPublished Mar 14, 2026

Prerequisites

  • Azure subscription with Azure OpenAI access approved
  • Basic understanding of LLMs and prompt engineering
  • Familiarity with Azure resource management

Introduction to Azure OpenAI Service

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, DALL-E 3, Whisper, and text embedding models. Unlike using OpenAI directly, Azure OpenAI runs within Azure's infrastructure, providing enterprise features: private networking via VNet integration, managed identity authentication, content filtering, regional data residency, and compliance with standards like SOC 2, HIPAA, and ISO 27001.

For organizations that already use Azure, Azure OpenAI is the natural choice for AI workloads because it integrates with existing identity management (Entra ID), networking (Private Endpoints), monitoring (Azure Monitor), and security controls. Your data stays within your Azure boundary and is not used to train or improve OpenAI models.

This guide covers everything from provisioning your first Azure OpenAI resource to building production RAG applications: resource creation, model deployment, API usage, embeddings, content filtering, RAG with Azure AI Search, prompt engineering patterns, monitoring, and cost optimization.

Access Requirements

Azure OpenAI requires an approved Azure subscription. Access is granted through an application process at aka.ms/oai/access. Once approved, you can create Azure OpenAI resources in supported regions. Approval typically takes 1-5 business days. Enterprise customers with Microsoft field engagement often get expedited access.

Creating an Azure OpenAI Resource

An Azure OpenAI resource is a container within your Azure subscription that hosts your model deployments. Each resource has a unique endpoint URL, supports multiple model deployments, and has its own rate limits. You can create multiple resources in different regions for redundancy and to increase aggregate throughput.

bash
# Create a resource group
az group create \
  --name "rg-openai" \
  --location "eastus"

# Create the Azure OpenAI resource
az cognitiveservices account create \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --kind "OpenAI" \
  --sku "S0" \
  --location "eastus" \
  --custom-domain "openai-contoso-eastus" \
  --tags "Environment=production" "Team=platform"

# Get the endpoint and keys
ENDPOINT=$(az cognitiveservices account show \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --query "properties.endpoint" -o tsv)

KEY=$(az cognitiveservices account keys list \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --query "key1" -o tsv)

echo "Endpoint: $ENDPOINT"

# Deploy GPT-4o model
az cognitiveservices account deployment create \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --deployment-name "gpt-4o" \
  --model-name "gpt-4o" \
  --model-version "2024-11-20" \
  --model-format "OpenAI" \
  --sku-name "Standard" \
  --sku-capacity 80

# Deploy an embedding model
az cognitiveservices account deployment create \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --deployment-name "text-embedding-3-large" \
  --model-name "text-embedding-3-large" \
  --model-version "1" \
  --model-format "OpenAI" \
  --sku-name "Standard" \
  --sku-capacity 120

# List all deployments
az cognitiveservices account deployment list \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --query "[].{Name:name, Model:properties.model.name, Version:properties.model.version, Capacity:sku.capacity}" \
  -o table

Regional Model Availability

Not all models are available in every Azure region. GPT-4o is available in East US, West US, Sweden Central, and a few other regions. Check the Azure OpenAI model availability documentation for the latest matrix. For production workloads, deploy resources in multiple regions to enable failover and increase total throughput capacity.

Using the Chat Completions API

The Chat Completions API is the primary interface for interacting with GPT models. It accepts a series of messages (system, user, assistant) and returns a model-generated response. The API is compatible with OpenAI's API, so you can use the OpenAI Python SDK with Azure-specific configuration.

python
from openai import AzureOpenAI

# Initialize the client with Azure configuration
client = AzureOpenAI(
    api_key="your-api-key",  # Or use DefaultAzureCredential
    api_version="2024-10-21",
    azure_endpoint="https://openai-contoso-eastus.openai.azure.com"
)

# Simple chat completion
response = client.chat.completions.create(
    model="gpt-4o",  # This is the deployment name, not the model name
    messages=[
        {
            "role": "system",
            "content": "You are a cloud architect assistant. Provide concise, actionable advice."
        },
        {
            "role": "user",
            "content": "What is the best way to set up disaster recovery for an Azure SQL Database?"
        }
    ],
    temperature=0.3,
    max_tokens=1024,
    top_p=0.95,
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")

# Using Entra ID authentication (recommended for production)
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(
    azure_ad_token=token.token,
    api_version="2024-10-21",
    azure_endpoint="https://openai-contoso-eastus.openai.azure.com"
)

# Streaming response for real-time output
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Explain Azure Virtual WAN in 5 bullet points."}
    ],
    stream=True,
    max_tokens=512,
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Use Entra ID Authentication

API key authentication works but is not recommended for production. Use Entra ID (Azure AD) authentication with managed identities. Assign the "Cognitive Services OpenAI User" role to the managed identity of your application. This eliminates API key management, provides automatic credential rotation, and integrates with Azure's RBAC and audit logging.

Embeddings and Vector Search

Embeddings convert text into high-dimensional vectors that capture semantic meaning. Similar texts produce vectors that are close together in vector space, enabling semantic search, clustering, and recommendation systems. Azure OpenAI offers the text-embedding-3-large model (3,072 dimensions) and text-embedding-3-small (1,536 dimensions).

python
# Generate embeddings
def get_embeddings(texts, model="text-embedding-3-large"):
    """Generate embeddings for a list of texts."""
    response = client.embeddings.create(
        model=model,
        input=texts,
        dimensions=1536,  # Reduce dimensions for cost/speed (optional)
    )
    return [item.embedding for item in response.data]

# Example: Embed documents
documents = [
    "Azure Kubernetes Service provides managed Kubernetes clusters",
    "Azure Functions is a serverless compute platform",
    "Azure SQL Database is a managed relational database service",
    "Azure Cosmos DB is a globally distributed NoSQL database",
]

embeddings = get_embeddings(documents)
print(f"Generated {len(embeddings)} embeddings, each with {len(embeddings[0])} dimensions")

# Calculate cosine similarity between query and documents
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

query = "I need a database that works across multiple regions"
query_embedding = get_embeddings([query])[0]

similarities = [
    (doc, cosine_similarity(query_embedding, emb))
    for doc, emb in zip(documents, embeddings)
]

# Sort by similarity (most relevant first)
for doc, score in sorted(similarities, key=lambda x: x[1], reverse=True):
    print(f"  {score:.4f}: {doc}")

Building RAG with Azure AI Search

Retrieval-Augmented Generation (RAG) combines vector search with LLM generation. Azure AI Search (formerly Azure Cognitive Search) is the recommended vector store for Azure OpenAI RAG applications. It supports hybrid search (combining vector similarity with traditional keyword search), semantic ranking, and integrated vectorization that automatically generates embeddings during indexing.

python
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticSearch,
    SemanticField,
    SemanticPrioritizedFields,
)
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# Create a search index with vector fields
index_client = SearchIndexClient(
    endpoint="https://search-contoso.search.windows.net",
    credential=credential,
)

index = SearchIndex(
    name="knowledge-base",
    fields=[
        SearchField(name="id", type=SearchFieldDataType.String, key=True),
        SearchField(name="content", type=SearchFieldDataType.String, searchable=True),
        SearchField(name="title", type=SearchFieldDataType.String, searchable=True),
        SearchField(name="source", type=SearchFieldDataType.String, filterable=True),
        SearchField(
            name="content_vector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,
            vector_search_profile_name="vector-profile",
        ),
    ],
    vector_search=VectorSearch(
        algorithms=[HnswAlgorithmConfiguration(name="hnsw-config")],
        profiles=[
            VectorSearchProfile(
                name="vector-profile",
                algorithm_configuration_name="hnsw-config",
            )
        ],
    ),
    semantic_search=SemanticSearch(
        configurations=[
            SemanticConfiguration(
                name="semantic-config",
                prioritized_fields=SemanticPrioritizedFields(
                    content_fields=[SemanticField(field_name="content")],
                    title_field=SemanticField(field_name="title"),
                ),
            )
        ]
    ),
)

index_client.create_or_update_index(index)

# Index documents with embeddings
search_client = SearchClient(
    endpoint="https://search-contoso.search.windows.net",
    index_name="knowledge-base",
    credential=credential,
)

documents_to_index = []
for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
    documents_to_index.append({
        "id": str(i),
        "content": doc,
        "title": f"Document {i}",
        "source": "internal-docs",
        "content_vector": embedding,
    })

search_client.upload_documents(documents_to_index)

# RAG query: retrieve relevant documents and generate answer
def rag_query(question):
    # Step 1: Embed the question
    query_vector = get_embeddings([question])[0]

    # Step 2: Hybrid search (vector + keyword)
    results = search_client.search(
        search_text=question,
        vector_queries=[{
            "vector": query_vector,
            "k_nearest_neighbors": 5,
            "fields": "content_vector",
        }],
        query_type="semantic",
        semantic_configuration_name="semantic-config",
        top=5,
    )

    # Step 3: Build context from search results
    context = "\n\n".join([r["content"] for r in results])

    # Step 4: Generate answer with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"""Answer the user's question based on the following context.
If the context doesn't contain the answer, say so.

Context:
{context}"""
            },
            {"role": "user", "content": question}
        ],
        temperature=0.1,
        max_tokens=1024,
    )

    return response.choices[0].message.content

answer = rag_query("Which Azure service supports global distribution?")
print(answer)

On Your Data Feature

Azure OpenAI also provides an "On Your Data" feature that handles the RAG pipeline automatically. You connect an Azure AI Search index, and the API handles retrieval and citation generation. This is faster to set up but gives you less control over the retrieval and prompting logic. Use the custom RAG approach above for production applications where you need fine-grained control.

Content Filtering

Azure OpenAI includes built-in content filtering that evaluates both inputs and outputs for harmful content categories: hate, sexual, violence, and self-harm. Each category has configurable severity thresholds (low, medium, high). You can also configure jailbreak detection, protected material detection, and custom blocklists.

bash
# Create a custom content filter configuration
az cognitiveservices account deployment update \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --deployment-name "gpt-4o" \
  --content-filter "custom-filter-v1"

# Content filter configuration (via REST API)
# POST https://management.azure.com/subscriptions/{sub}/resourceGroups/{rg}/
#   providers/Microsoft.CognitiveServices/accounts/{account}/
#   raiPolicies/{policyName}?api-version=2024-10-01

# Example filter configuration JSON:
# {
#   "properties": {
#     "basePolicyName": "Microsoft.DefaultV2",
#     "contentFilters": [
#       {"name": "hate", "blocking": true, "severity": "medium", "source": "prompt"},
#       {"name": "hate", "blocking": true, "severity": "medium", "source": "completion"},
#       {"name": "sexual", "blocking": true, "severity": "high", "source": "prompt"},
#       {"name": "sexual", "blocking": true, "severity": "high", "source": "completion"},
#       {"name": "violence", "blocking": true, "severity": "medium", "source": "prompt"},
#       {"name": "violence", "blocking": true, "severity": "medium", "source": "completion"},
#       {"name": "selfharm", "blocking": true, "severity": "medium", "source": "prompt"},
#       {"name": "selfharm", "blocking": true, "severity": "medium", "source": "completion"},
#       {"name": "jailbreak", "blocking": true, "source": "prompt"},
#       {"name": "protected_material_text", "blocking": true, "source": "completion"}
#     ]
#   }
# }

Private Networking

For production deployments, Azure OpenAI should be accessed through Private Endpoints, ensuring all traffic stays within your Azure VNet and never traverses the public internet. This is a compliance requirement for many organizations and a security best practice regardless.

bash
# Disable public network access
az cognitiveservices account update \
  --name "openai-contoso-eastus" \
  --resource-group "rg-openai" \
  --public-network-access "Disabled"

# Create a Private Endpoint
az network private-endpoint create \
  --name "pe-openai-eastus" \
  --resource-group "rg-openai" \
  --vnet-name "vnet-app-eastus" \
  --subnet "snet-private-endpoints" \
  --private-connection-resource-id $(az cognitiveservices account show \
    --name "openai-contoso-eastus" \
    --resource-group "rg-openai" \
    --query id -o tsv) \
  --group-ids "account" \
  --connection-name "openai-connection"

# Create Private DNS zone for name resolution
az network private-dns zone create \
  --resource-group "rg-openai" \
  --name "privatelink.openai.azure.com"

az network private-dns link vnet create \
  --resource-group "rg-openai" \
  --zone-name "privatelink.openai.azure.com" \
  --name "openai-dns-link" \
  --virtual-network "vnet-app-eastus" \
  --registration-enabled false

az network private-endpoint dns-zone-group create \
  --resource-group "rg-openai" \
  --endpoint-name "pe-openai-eastus" \
  --name "openai-dns-group" \
  --private-dns-zone "privatelink.openai.azure.com" \
  --zone-name "openai"

Monitoring and Cost Management

Azure OpenAI publishes metrics to Azure Monitor including request count, token usage, latency, and HTTP errors. Enable diagnostic logging to capture detailed per-request information including prompts and completions (be careful with PII in logs).

bash
# Enable diagnostic logging
az monitor diagnostic-settings create \
  --name "openai-diagnostics" \
  --resource $(az cognitiveservices account show \
    --name "openai-contoso-eastus" \
    --resource-group "rg-openai" \
    --query id -o tsv) \
  --workspace $(az monitor log-analytics workspace show \
    --workspace-name "law-central-eastus" \
    --resource-group "rg-management" \
    --query id -o tsv) \
  --logs '[
    {"category": "Audit", "enabled": true},
    {"category": "RequestResponse", "enabled": true},
    {"category": "Trace", "enabled": true}
  ]' \
  --metrics '[{"category": "AllMetrics", "enabled": true}]'

# View token usage metrics
az monitor metrics list \
  --resource $(az cognitiveservices account show \
    --name "openai-contoso-eastus" \
    --resource-group "rg-openai" \
    --query id -o tsv) \
  --metric "TokenTransaction" \
  --interval "PT1H" \
  --start-time $(date -u -v-24H '+%Y-%m-%dT%H:%M:%SZ') \
  --end-time $(date -u '+%Y-%m-%dT%H:%M:%SZ') \
  --aggregation "Total" \
  --dimension "ModelDeploymentName"

# KQL query for token usage analysis (in Log Analytics)
# AzureDiagnostics
# | where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
# | where Category == "RequestResponse"
# | extend model = tostring(parse_json(properties_s).modelDeploymentName)
# | extend promptTokens = toint(parse_json(properties_s).promptTokens)
# | extend completionTokens = toint(parse_json(properties_s).completionTokens)
# | summarize TotalPromptTokens=sum(promptTokens),
#             TotalCompletionTokens=sum(completionTokens),
#             RequestCount=count()
#   by model, bin(TimeGenerated, 1h)
# | order by TimeGenerated desc

Cost Comparison

ModelInput (per 1M tokens)Output (per 1M tokens)Best For
GPT-4o$2.50$10.00General purpose, best value
GPT-4o mini$0.15$0.60Simple tasks, high volume
GPT-4 Turbo$10.00$30.00Legacy, use GPT-4o instead
text-embedding-3-large$0.13N/AHigh-quality embeddings
text-embedding-3-small$0.02N/ACost-effective embeddings

Provisioned Throughput Units (PTUs)

For predictable, high-volume workloads, consider Provisioned Throughput Units instead of pay-per-token. PTUs provide guaranteed throughput capacity and can be 40-60% cheaper than pay-per-token at high volumes. Each PTU provides a fixed number of tokens per minute regardless of demand. You need at least 50 PTUs per deployment.

Production Architecture

A production Azure OpenAI deployment requires careful architecture for reliability, security, and cost management. Here are the key patterns.

Multi-region deployment: Deploy Azure OpenAI resources in at least two regions. Use Azure API Management or a custom load balancer to route traffic. Implement circuit breaker patterns to failover when a region is throttled or unhealthy.

API Management gateway: Place Azure API Management in front of Azure OpenAI to handle authentication, rate limiting, request/response logging, caching, and multi-backend load balancing. APIM can distribute requests across multiple OpenAI resources to multiply your effective quota.

Response caching: Cache responses for identical or semantically similar prompts using Azure Cache for Redis. This reduces costs and latency for frequently asked questions.

python
import hashlib
import json
import redis
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential

# Redis cache for response caching
cache = redis.Redis(host="redis-contoso.redis.cache.windows.net", port=6380, ssl=True)

# Multi-region failover client
REGIONS = {
    "eastus": "https://openai-contoso-eastus.openai.azure.com",
    "westus": "https://openai-contoso-westus.openai.azure.com",
}

def get_cached_completion(messages, model="gpt-4o", temperature=0, max_tokens=1024):
    """Get completion with caching and multi-region failover."""

    # Generate cache key from messages
    cache_key = hashlib.sha256(
        json.dumps({"messages": messages, "model": model, "temp": temperature}).encode()
    ).hexdigest()

    # Check cache first
    cached = cache.get(f"openai:{cache_key}")
    if cached:
        return json.loads(cached)

    # Try each region
    credential = DefaultAzureCredential()
    token = credential.get_token("https://cognitiveservices.azure.com/.default")

    for region, endpoint in REGIONS.items():
        try:
            client = AzureOpenAI(
                azure_ad_token=token.token,
                api_version="2024-10-21",
                azure_endpoint=endpoint,
            )
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
            )
            result = {
                "content": response.choices[0].message.content,
                "usage": {
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                },
                "region": region,
            }
            # Cache for 1 hour (only cache deterministic responses)
            if temperature == 0:
                cache.setex(f"openai:{cache_key}", 3600, json.dumps(result))
            return result
        except Exception as e:
            print(f"Region {region} failed: {e}")
            continue

    raise Exception("All regions failed")

Next Steps

You now have a comprehensive understanding of Azure OpenAI Service. Start by creating a resource and deploying GPT-4o, then build a simple chat application. Once that works, add RAG with Azure AI Search to ground responses in your own data, and implement content filtering and private networking for production readiness.

Explore Azure AI Studio for a visual interface to experiment with models, build prompt flows, and evaluate model performance. For complex AI applications, look into Semantic Kernel (Microsoft's orchestration framework) which provides abstractions for chaining prompts, using tools, and managing conversation memory.

Microsoft Entra ID for AuthenticationAzure Landing ZonesAI Services Across Clouds: Bedrock vs Azure OpenAI vs Vertex AI

Key Takeaways

  1. 1Azure OpenAI provides OpenAI models (GPT-4o, embeddings) with enterprise Azure security and compliance.
  2. 2Use Entra ID authentication with managed identities instead of API keys for production.
  3. 3Azure AI Search with hybrid search (vector + keyword + semantic ranking) provides best-in-class RAG.
  4. 4Content filtering evaluates both inputs and outputs for harmful content with configurable thresholds.
  5. 5Private Endpoints keep all traffic within your VNet for compliance requirements.
  6. 6Multi-region deployment with API Management provides failover and increased aggregate throughput.

Frequently Asked Questions

What is the difference between Azure OpenAI and OpenAI direct?
Azure OpenAI runs on Azure infrastructure with enterprise features: private networking, Entra ID authentication, content filtering, data residency, and compliance certifications (SOC 2, HIPAA). The API is compatible with the OpenAI Python SDK. Data is not used for model training.
How do I get access to Azure OpenAI?
Apply at aka.ms/oai/access. Approval typically takes 1-5 business days. You need a valid Azure subscription. Enterprise customers with Microsoft field engagement often get expedited access.
Should I use PTU or pay-per-token?
Use pay-per-token for development and variable workloads. PTU (Provisioned Throughput Units) provides guaranteed capacity at lower per-token cost for high-volume production workloads. PTU requires a minimum commitment of 50 units.
How do I implement RAG with Azure OpenAI?
The recommended approach is Azure AI Search with vector fields, semantic ranking, and hybrid search. Index your documents with embeddings, then query AI Search at inference time to retrieve relevant chunks. Pass the chunks as context in the system prompt.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.