Event-Driven Architecture on AWS, Azure, and GCP: Patterns That Scale

Why Event-Driven Architecture Is Winning

Synchronous request-response architectures hit a wall as systems grow. Service A calls Service B, which calls Service C, which calls Service D. If Service D is slow, every service in the chain is slow. If Service D is down, every service in the chain is either down or degraded. Add retry logic and the system becomes a cascading failure generator. You end up with distributed monolith behavior -- all the complexity of microservices with none of the independence.

Event-driven architecture breaks this coupling. Instead of Service A calling Service B directly, Service A emits an event ("order placed"), and any number of downstream services can react to that event independently. Service B processes the payment. Service C updates inventory. Service D sends a confirmation email. None of them know about each other, and none of them are blocked by the others. If Service D is slow, emails go out late, but the order is still processed and inventory is still updated.

This guide covers the practical implementation of event-driven patterns on AWS, Azure, and GCP. We compare the native event routing services (EventBridge, Event Grid, Eventarc), the pub/sub messaging systems, and the real patterns that work for order processing, real-time analytics, and cross-service communication. The focus is on decisions that matter in production: when to use choreography versus orchestration, how to handle failures, and what the actual costs look like.

AWS EventBridge: The Event Router

Amazon EventBridge is the central nervous system for event-driven architectures on AWS. It receives events from AWS services (over 90 sources generate EventBridge events natively), SaaS integrations, and your own applications, then routes them to targets based on pattern-matching rules. Each event bus can have up to 300 rules, and each rule can route to up to 5 targets.

EventBridge's strength is its content-based routing. Rules match on any field in the event JSON, including nested fields, using exact match, prefix match, numeric comparison, and exists/not-exists conditions. This means you can create sophisticated routing logic without writing code. For example, a single rule can match events where the source is "orders-service", the detail-type is "OrderPlaced", the detail.amount is greater than 1000, and the detail.region is "us-east-1" -- routing high-value US orders to a special processing pipeline.

EventBridge pricing is straightforward: $1.00 per million events published to a custom event bus. AWS service events published to the default event bus are free. Schema discovery and the schema registry are free. Event replay (replaying historical events for reprocessing) costs $0.10 per million events replayed. For most applications, EventBridge costs are negligible compared to the compute costs of processing events.

Build EventBridge rules interactively

EventBridge Pipes

EventBridge Pipes is a newer feature that connects event sources to targets with optional filtering, enrichment, and transformation steps in between. The source can be SQS, Kinesis, DynamoDB Streams, Kafka, or MQ. The target can be any EventBridge target. The enrichment step can call a Lambda function, Step Functions, API Gateway, or API Destination to add data to the event before it reaches the target.

Pipes are most useful for connecting data sources that do not natively publish to EventBridge. For example, a DynamoDB Stream that captures changes to an orders table can feed into an EventBridge Pipe that enriches each change event with customer data from another service, then routes the enriched event to different targets based on the order status. Without Pipes, you would write a Lambda function to read from the DynamoDB Stream, fetch enrichment data, and publish to EventBridge -- Pipes eliminates that glue code.

EventBridge Scheduler

EventBridge Scheduler (separate from EventBridge rules with schedule expressions) supports one-time and recurring schedules with at-least-once delivery, automatic retries, and dead-letter queues. It handles millions of scheduled events and is the right choice for future-dated actions: sending a reminder email 24 hours after signup, triggering a payment retry 3 days after failure, or running a cleanup job every night at 2 AM.

Build EventBridge cron expressions

Azure Event Grid: Native Event Routing for Azure

Azure Event Grid is Microsoft's event routing service, conceptually similar to EventBridge but with some architectural differences. Event Grid uses a topic-subscription model. Event sources publish to topics (system topics for Azure service events, custom topics for your applications), and subscriptions define which events to deliver to which endpoints based on filtering criteria.

Event Grid's filtering supports subject filtering (prefix and suffix matching on the subject field), event type filtering, and advanced filtering on event data fields. The advanced filtering is less flexible than EventBridge's pattern matching -- it supports string equals, string begins with, string ends with, number comparisons, boolean, and null checks, but not regex or nested field matching.

Pricing is competitive: the first 100,000 operations per month are free, then $0.60 per million operations. An "operation" is an event publish or delivery attempt. For most applications, Event Grid costs are minimal. The free tier alone covers many development and low-traffic production workloads.

Event Grid Namespaces (MQTT and Pull Delivery)

Event Grid Namespaces is a newer capability that adds MQTT broker support and pull-based delivery. The MQTT support is significant for IoT workloads -- devices can publish MQTT messages directly to Event Grid without a separate MQTT broker, and Event Grid routes them to Azure services. Pull delivery lets consumers request events on their own schedule rather than receiving push notifications, which is useful for batch processing and consumers that cannot expose public HTTP endpoints.

GCP Eventarc: Unified Event Delivery

Eventarc is Google Cloud's event delivery service. It routes events from over 130 Google Cloud sources (via Cloud Audit Logs and direct event publishing) to Cloud Run, Cloud Functions, GKE, and Workflows targets. Eventarc uses the CloudEvents specification as its standard event format, which makes it the most standards-compliant option of the three.

Eventarc's event sources fall into three categories: direct events (published by services like Cloud Storage, Pub/Sub, and Firebase), Cloud Audit Log events (any audited API call in GCP can be an event source), and third-party events (via Eventarc's integration with partner event providers). The Cloud Audit Log integration is particularly powerful -- it means any action in GCP can trigger an event without the service explicitly supporting Eventarc. If someone creates a VM, modifies a firewall rule, or grants an IAM permission, you can capture and react to that event.

Eventarc itself is free. You pay for the underlying transport (Pub/Sub) and the compute targets. This makes it the cheapest option for event routing at low volumes, since Pub/Sub's free tier covers the first 10 GB of messages per month.

Pub/Sub Messaging: The Foundation Layer

Event routing services like EventBridge, Event Grid, and Eventarc handle the routing and filtering of events. But for reliable, high-throughput messaging between services, you typically need a dedicated pub/sub messaging system underneath.

AWS: SQS + SNS

The classic AWS pattern is SNS (fan-out) plus SQS (queuing). SNS publishes a message to a topic, which delivers copies to multiple SQS queue subscribers. Each consumer reads from its own SQS queue at its own pace. Failed messages go to dead-letter queues after a configurable number of retry attempts. This pattern provides exactly the decoupling that event-driven architecture requires: publishers and consumers are completely independent.

SQS Standard queues deliver at least once with best-effort ordering, handling virtually unlimited throughput. SQS FIFO queues guarantee exactly-once processing and strict ordering within a message group, but are limited to 3,000 messages per second (with batching). For most event-driven workloads, Standard queues are the right choice -- design your consumers to be idempotent (processing the same event twice produces the same result) and the at-least-once delivery is not a problem.

Azure: Service Bus

Azure Service Bus is a full-featured enterprise message broker supporting queues, topics with subscriptions, and sessions. It supports both at-least-once and at-most-once delivery, message ordering, message deferral, scheduled delivery, and transactions that span multiple operations. Service Bus Premium tier runs on dedicated infrastructure and supports messages up to 100 MB.

For event-driven architectures, Service Bus topics with subscriptions are the equivalent of SNS+SQS. Each subscription can have filter rules that select a subset of messages, so different consumers receive only the events they care about. The difference from SNS+SQS is that Service Bus handles both the fan-out and the queuing in a single service, which reduces the number of components to manage.

GCP: Pub/Sub

Google Cloud Pub/Sub is the simplest messaging system of the three. Topics publish messages, subscriptions receive them. Each subscription receives a copy of every message (fan-out), and each message within a subscription is delivered to one consumer (load balancing). Pub/Sub scales automatically, supports messages up to 10 MB, and retains unacknowledged messages for up to 7 days.

Pub/Sub's standout feature is ordering keys. Messages published with the same ordering key are delivered in order to the subscriber. Messages with different ordering keys (or no ordering key) may be delivered out of order. This lets you maintain per-entity ordering (all events for order #12345 arrive in sequence) without sacrificing throughput across entities.

Choreography vs. Orchestration

This is the most important architectural decision in event-driven systems, and getting it wrong creates systems that are either too rigid or too chaotic.

Choreography: Decentralized Coordination

In choreography, each service reacts to events independently. There is no central coordinator. The "order placed" event triggers the payment service, which emits "payment processed," which triggers the fulfillment service, which emits "order shipped," which triggers the notification service. Each service knows only about the events it consumes and produces.

Choreography works well for loosely coupled workflows with a small number of steps (3 to 5). It breaks down when workflows get complex, because the implicit flow is hard to understand, debug, and modify. When an order fails halfway through a 10-step choreographed workflow, figuring out which step failed and what compensating actions to take requires reading logs from 10 different services. There is no single place that shows the current state of the workflow.

Orchestration: Centralized Coordination

In orchestration, a central coordinator (a state machine or workflow engine) manages the process. AWS Step Functions, Azure Durable Functions, and GCP Workflows are the cloud-native orchestration tools. The orchestrator calls each service in sequence (or parallel), handles retries, manages state, and makes decisions based on service responses.

Orchestration is better for complex workflows with branching logic, error handling, and compensation steps. The workflow is visible in one place -- you can see the current state of every active execution, which step failed, and what happened before and after. The tradeoff is tighter coupling: the orchestrator knows about all participating services, and changes to the workflow require updating the orchestrator definition.

The pragmatic approach

Use choreography for simple, independent reactions to events (sending notifications, updating caches, publishing analytics). Use orchestration for business-critical workflows with multiple steps, error handling, and rollback requirements (order processing, user onboarding, data pipelines). Most production systems use both patterns for different parts of the application.

Real Pattern: Order Processing

Order processing is the canonical example of event-driven architecture. Here is how it typically works in production, using a hybrid choreography/orchestration approach.

The order service receives a new order and publishes an "OrderPlaced" event to EventBridge (or Event Grid, or Pub/Sub). This is the entry point. An orchestrator (Step Functions, Durable Functions, or Workflows) subscribes to this event and manages the core order workflow:

Validate the order (check inventory, validate addresses, verify customer status)
Process payment (call payment service, handle authorization and capture)
Reserve inventory (decrement stock, handle out-of-stock scenarios)
Create shipment (generate shipping label, schedule pickup)
Update order status to "confirmed"

If any step fails, the orchestrator executes compensation steps: refund payment, release inventory reservation, cancel shipment. This is the saga pattern, and it is dramatically easier to implement with an orchestrator than with pure choreography.

Meanwhile, choreographed side effects run in parallel. The "OrderPlaced" event also triggers: an analytics service that updates real-time dashboards, a notification service that sends the order confirmation email, a fraud detection service that scores the transaction, and a recommendation engine that updates the customer's purchase history. These are fire-and-forget operations -- if the analytics update fails, the order still processes successfully.

Real Pattern: Real-Time Analytics Pipeline

Event-driven architecture is ideal for real-time analytics because events naturally represent the data points you want to analyze. A typical pattern on AWS:

Application events (page views, clicks, API calls) are published to Kinesis Data Streams
A Kinesis Data Firehose delivery stream buffers events and writes them to S3 in Parquet format every 60 seconds
EventBridge rules route specific high-value events (purchases, signups, errors) to Lambda functions for real-time processing
Lambda functions update DynamoDB counters (real-time dashboards) and publish aggregated metrics to CloudWatch
Athena queries the S3 data lake for ad-hoc historical analysis

On GCP, the equivalent pattern uses Pub/Sub for ingestion, Dataflow for stream processing, BigQuery for the data warehouse (with streaming inserts for near-real-time), and Looker or Data Studio for dashboards. GCP's advantage here is that BigQuery can handle both the real-time streaming inserts and the historical analytical queries in a single system, whereas the AWS pattern requires Kinesis + S3 + Athena as separate components.

On Azure, Event Hubs handles high-throughput ingestion (millions of events per second), Stream Analytics processes events in real-time using SQL-like queries, and the results flow to Cosmos DB (for real-time serving), Azure Synapse (for analytical queries), and Power BI (for dashboards).

Failure Handling: The Hard Part

Event-driven systems are easy to build for the happy path and surprisingly hard to build for failure scenarios. Every event-driven system needs answers to these questions:

What Happens When a Consumer Fails?

Configure dead-letter queues (DLQs) on every queue and subscription. When a message fails processing after the configured retry count (typically 3 to 5 attempts with exponential backoff), it moves to the DLQ. Monitor DLQ depth as a critical metric -- a growing DLQ means events are being lost. Build tooling to inspect DLQ messages, fix the consumer bug, and replay the messages.

What About Duplicate Events?

At-least-once delivery means consumers may process the same event twice. Design every consumer to be idempotent. For database writes, use upserts or conditional writes (DynamoDB's ConditionExpression, SQL's INSERT ON CONFLICT). For API calls, include an idempotency key. For financial transactions, check whether the transaction ID has already been processed before executing it. Idempotency is not optional in event-driven systems -- it is a hard requirement.

What About Event Ordering?

Most event systems do not guarantee global ordering. SQS Standard, Pub/Sub (without ordering keys), and Event Grid deliver events in approximate order. If your business logic requires strict ordering (process payment before shipping, apply discount before calculating total), either use a FIFO queue/ordered subscription for that specific flow, or design your consumers to handle out-of-order events using version numbers or timestamps.

The poison message problem

A poison message is an event that consistently fails processing due to bad data, a bug, or an unhandled edge case. Without a DLQ, a poison message blocks the entire queue -- the consumer retries it forever, and all subsequent messages pile up behind it. Always configure DLQs with alerting. A single poison message in a high-throughput system can cause minutes of downstream delay before anyone notices.

Cost Comparison

For a system processing 10 million events per month with fan-out to 3 consumers:

AWS (EventBridge + SQS): EventBridge at $10/month (10M events) + 3 SQS queues at approximately $4/month each (10M messages per queue) = approximately $22/month
AWS (SNS + SQS): SNS at $5/month (10M publishes) + 3 SQS queues at approximately $4/month each = approximately $17/month
Azure (Event Grid + Service Bus): Event Grid at approximately $6/month (10M operations) + Service Bus Basic at approximately $5/month = approximately $11/month
GCP (Eventarc + Pub/Sub): Eventarc free + Pub/Sub at approximately $4/month (30M message deliveries across 3 subscriptions) = approximately $4/month

GCP is the cheapest for messaging at moderate volumes. Azure is competitive. AWS is the most expensive but offers the most sophisticated routing and filtering capabilities. At these price points, the cost difference is unlikely to drive your provider choice -- pick the platform where your team has expertise.

Recommendations

After building event-driven systems across all three clouds, here is what I recommend.

Start simple. If you are building your first event-driven system, use SNS+SQS on AWS, Service Bus topics on Azure, or Pub/Sub on GCP. These are the foundational patterns, and you should be comfortable with them before adding EventBridge rules, Event Grid advanced filtering, or Eventarc triggers. The basics of publishing events, consuming from queues, handling failures with DLQs, and designing idempotent consumers are the same across all platforms.

Add EventBridge (or Event Grid, or Eventarc) when you need content-based routing, multi-target fan-out with filtering, or integration with SaaS event sources. These services add a routing layer on top of the messaging foundation, and they are most valuable when you have many consumer services that each care about different subsets of events.

Use orchestration for business-critical workflows with more than three steps. Step Functions Express Workflows handle high-throughput, short-duration workflows (up to 5 minutes) at $1 per million state transitions. Standard Workflows handle long-running workflows (up to one year) with exactly-once execution at $0.025 per state transition. The visibility and error handling capabilities of an orchestrator save more in debugging time than they cost in service fees.

Monitor everything. Track event publication rates, consumer lag (how far behind consumers are), DLQ depth, event processing latency, and failed delivery counts. In a synchronous system, failures are immediately visible as HTTP 500 errors. In an event-driven system, failures can be silent -- events pile up in DLQs, consumers fall behind, and nobody notices until a customer complains. Alerting on these metrics is not optional.

Start with events, evolve to event sourcing

Event-driven architecture and event sourcing are different things. Event-driven means services communicate via events. Event sourcing means the event log is the source of truth for application state. Start with event-driven communication between services. Consider event sourcing only for specific domains where the audit trail and temporal queries justify the additional complexity -- financial transactions, inventory management, and compliance-heavy systems.

Estimate Azure Event Grid costs Estimate GCP Pub/Sub costs