Skip to main content
AWSArchitecturebeginner

Well-Architected Framework

Overview of the AWS Well-Architected Framework pillars and how to apply them.

CloudToolStack Team24 min readPublished Feb 22, 2026

Prerequisites

  • Basic AWS services knowledge
  • Understanding of cloud architecture concepts

What Is the Well-Architected Framework?

The AWS Well-Architected Framework is a set of best practices and design principles that help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. Developed from AWS's experience reviewing thousands of customer architectures, it provides a consistent and repeatable approach to evaluating your systems against industry best practices and identifying areas for improvement.

The framework is organized into six pillars, each focusing on a different aspect of cloud architecture. Together, they provide a comprehensive methodology for designing, building, and operating workloads in the cloud. The pillars are not independent; decisions in one pillar affect outcomes in others. For example, choosing a highly available architecture (Reliability pillar) increases costs (Cost Optimization pillar) and adds operational complexity (Operational Excellence pillar). The Well-Architected Framework helps you make these trade-offs explicitly and intentionally.

AWS provides the Well-Architected Tool in the console to formally review your workloads against these pillars. The tool guides you through a series of questions, records your answers, identifies high-risk issues, and generates an improvement plan with prioritized actions.

Free Architecture Reviews

AWS offers free Well-Architected reviews through the Well-Architected Tool in the AWS Console. Additionally, AWS Partner Network (APN) partners certified as Well-Architected Partners can conduct reviews. Completing a review through an APN partner may qualify you for AWS credits (typically $300-$5,000 depending on the program). Reviews are non-binding and confidential. They are designed to help you improve, not to audit you.

Pillar 1: Operational Excellence

Operational Excellence focuses on running and monitoring systems to deliver business value and continually improving processes and procedures. The goal is not just keeping the lights on, but building a culture of continuous improvement where teams learn from operational events and incorporate those learnings into their processes.

This pillar is often the most overlooked because its benefits are difficult to quantify in advance. However, organizations that invest in operational excellence consistently recover from incidents faster, deploy more frequently with fewer failures, and maintain higher system availability over time.

Key Practices

  • Infrastructure as Code: Define all infrastructure using CloudFormation, CDK, or Terraform. Never make manual console changes to production environments. IaC enables version control, peer review, automated testing, and repeatable deployments.
  • Observability: Implement the three pillars of observability: metrics (CloudWatch), logs (CloudWatch Logs with structured JSON), and traces (X-Ray). These three data types together provide complete visibility into system behavior.
  • Runbooks and playbooks: Document procedures for common operational events and failure scenarios. Automate runbooks using Systems Manager Automation. A runbook that exists only in someone's head is useless when that person is unavailable.
  • Small, reversible changes: Deploy frequently in small increments. Use feature flags to decouple deployment from release. Implement canary deployments to test changes with a small percentage of traffic before full rollout.
  • Anticipate failure: Conduct regular game days and chaos engineering exercises. Use AWS Fault Injection Service to simulate AZ failures, instance termination, and API errors in a controlled environment.
ssm-automation-runbook.yaml
# Systems Manager Automation runbook for incident response
description: Automated response to high CPU alarm
schemaVersion: '0.3'
assumeRole: '{{AutomationAssumeRole}}'
parameters:
  InstanceId:
    type: String
    description: EC2 instance with high CPU
  AutomationAssumeRole:
    type: String
    default: arn:aws:iam::123456789012:role/SSMAutomationRole
mainSteps:
  - name: captureMetrics
    action: aws:executeAwsApi
    inputs:
      Service: cloudwatch
      Api: GetMetricStatistics
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: '{{InstanceId}}'
      StartTime: '{{global:DATE_TIME}}'
      EndTime: '{{global:DATE_TIME}}'
      Period: 300
      Statistics:
        - Average
        - Maximum

  - name: createSnapshot
    action: aws:executeAwsApi
    inputs:
      Service: ec2
      Api: CreateSnapshot
      VolumeId: '{{describeInstance.VolumeId}}'
      Description: 'Diagnostic snapshot for high CPU incident'

  - name: notifyTeam
    action: aws:executeAwsApi
    inputs:
      Service: sns
      Api: Publish
      TopicArn: arn:aws:sns:us-east-1:123456789012:ops-alerts
      Message: 'High CPU detected on {{InstanceId}}. Diagnostic snapshot created.'

Game Days Build Confidence

Regularly conduct game days where you simulate failures in a controlled environment. Start with tabletop exercises where the team discusses “what would we do if...” scenarios. Then progress to live failure injection using AWS Fault Injection Service (FIS). Game days build team confidence, validate runbooks, verify monitoring coverage, and identify gaps in your recovery procedures before real incidents occur.

CloudFormation vs CDK: Infrastructure as Code Best Practices

Pillar 2: Security

The Security pillar covers protecting information, systems, and assets while delivering business value through risk assessment and mitigation strategies. AWS operates on the shared responsibility model: AWS secures the cloud infrastructure (hardware, software, networking, and facilities), and you secure everything you deploy on it (data, configuration, identity management, and application security).

Security is not a feature to add later. It must be embedded into every layer of your architecture from the beginning. The cost of retrofitting security is always higher than building it in from the start.

Security Design Principles

  • Implement a strong identity foundation: Centralize identity management with IAM Identity Center. Use IAM roles with least privilege for all workloads. Enforce MFA everywhere.
  • Enable traceability: Enable CloudTrail in all regions and accounts. Log everything, retain logs for your compliance period, and set up alerts for sensitive events.
  • Apply security at all layers: Defense in depth: VPC security groups, NACLs, WAF at the edge, encryption at rest and in transit, application-level authentication.
  • Automate security best practices: Use Security Hub for continuous compliance. Automate remediation with EventBridge and Lambda. Use Config rules for continuous monitoring.
  • Protect data in transit and at rest: Encrypt everything with KMS. Use TLS 1.2+ for all network communication. Classify data and apply controls appropriate to its sensitivity.
  • Prepare for security events: Pre-provision incident response IAM roles. Set up forensic investigation capabilities. Practice incident response procedures regularly.
security-baseline.sh
# Enable GuardDuty for threat detection
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency FIFTEEN_MINUTES

# Enable Security Hub with foundational standards
aws securityhub enable-security-hub \
  --enable-default-standards

# Enable AWS Config for continuous compliance monitoring
aws configservice put-configuration-recorder \
  --configuration-recorder '{
    "name": "default",
    "roleARN": "arn:aws:iam::123456789012:role/aws-config-role",
    "recordingGroup": {
      "allSupported": true,
      "includeGlobalResourceTypes": true
    }
  }'

# Enable IAM Access Analyzer
aws accessanalyzer create-analyzer \
  --analyzer-name account-analyzer \
  --type ACCOUNT

# Enable Macie for sensitive data discovery
aws macie2 enable-macie
AWS IAM Best Practices: Identity and Access Management Deep DiveAWS Security Hub Overview: Centralized Security Findings

Pillar 3: Reliability

Reliability ensures a workload performs its intended function correctly and consistently when it is expected to. This includes the ability to operate and test the workload through its total lifecycle, recover from infrastructure or service disruptions, and dynamically acquire computing resources to meet demand.

The key insight of the Reliability pillar is that failures are inevitable. You cannot prevent all failures, but you can design systems that tolerate them gracefully. The goal is not zero failures but rather fast detection, automatic recovery, and minimal impact on users.

Reliability Design Principles

  • Automatically recover from failure: Use health checks, auto-scaling, and automated recovery mechanisms to detect and recover from failures without human intervention.
  • Test recovery procedures: Regularly test your backup and recovery processes. An untested backup is not a backup.
  • Scale horizontally: Distribute workloads across multiple small resources rather than relying on a single large resource. This reduces the impact of any single failure.
  • Stop guessing capacity: Use auto-scaling to match capacity to demand. Over-provisioning wastes money; under-provisioning causes outages.
  • Manage change through automation: Use IaC and CI/CD pipelines for all infrastructure changes. Automated changes are repeatable, testable, and auditable.
reliability-auto-scaling.yaml
# Multi-AZ auto-scaling with health checks
AutoScalingGroup:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: 2
    MaxSize: 10
    DesiredCapacity: 2
    VPCZoneIdentifier:
      - !Ref PrivateSubnetA
      - !Ref PrivateSubnetB
      - !Ref PrivateSubnetC
    HealthCheckType: ELB
    HealthCheckGracePeriod: 120
    LaunchTemplate:
      LaunchTemplateId: !Ref LaunchTemplate
      Version: !GetAtt LaunchTemplate.LatestVersionNumber
    TargetGroupARNs:
      - !Ref TargetGroup
    # Use mixed instances for cost optimization and resilience
    MixedInstancesPolicy:
      InstancesDistribution:
        OnDemandBaseCapacity: 2
        OnDemandPercentageAboveBaseCapacity: 30
        SpotAllocationStrategy: capacity-optimized
      LaunchTemplate:
        LaunchTemplateSpecification:
          LaunchTemplateId: !Ref LaunchTemplate
          Version: !GetAtt LaunchTemplate.LatestVersionNumber
        Overrides:
          - InstanceType: m7g.large
          - InstanceType: m6g.large
          - InstanceType: m7i.large
          - InstanceType: c7g.large

ScalingPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AutoScalingGroupName: !Ref AutoScalingGroup
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ASGAverageCPUUtilization
      TargetValue: 60

Multi-AZ and Multi-Region Strategies

StrategyRTORPOCostComplexity
Multi-AZ (Active-Active)Near zeroNear zeroLow-MediumLow
Backup & Restore (cross-region)HoursHoursVery LowLow
Pilot Light (cross-region)10-30 minutesMinutesLowMedium
Warm Standby (cross-region)MinutesSecondsMediumMedium-High
Multi-Region Active-ActiveNear zeroNear zeroHigh (2x+)Very High

Test Your DR Plan

A disaster recovery plan that has never been tested is just a hope. Schedule quarterly DR drills where you actually fail over to your secondary region. Track the time it takes to detect the failure, execute the failover, validate the secondary environment, and fail back. Many teams discover their DR plans have critical gaps only during real incidents. Do not let that be you.

VPC Architecture Patterns: Multi-AZ and Multi-Region DesignRoute 53 DNS Patterns: Failover Routing

Pillar 4: Performance Efficiency

Performance Efficiency focuses on using computing resources efficiently to meet system requirements and maintaining that efficiency as demand changes and technologies evolve. The cloud gives you access to advanced technologies (serverless compute, managed databases, global content delivery) that would be impractical to implement on-premises.

Performance Design Principles

  • Democratize advanced technologies: Use managed services (Aurora, ElastiCache, OpenSearch) instead of self-managing databases. Let AWS handle the undifferentiated heavy lifting.
  • Go global in minutes: Deploy in multiple regions with CloudFront, Global Accelerator, and Route 53 latency-based routing to reduce latency for global users.
  • Use serverless architectures: Evaluate Lambda, Fargate, Aurora Serverless, and other serverless services before provisioning dedicated infrastructure. Serverless removes capacity planning entirely.
  • Experiment more often: Use the cloud's pay-as-you-go model to test different instance types, database engines, and architectures without long-term commitment.
  • Consider mechanical sympathy: Understand how technology works and choose the right approach for your workload. Use the right tool for the job: DynamoDB for key-value, ElastiCache for caching, SQS for queuing.

Caching Strategy

LayerServiceWhat to CacheTypical TTL
Edge / CDNCloudFrontStatic assets, API responsesMinutes to days
ApplicationElastiCache (Redis/Memcached)Session data, computed results, API responsesSeconds to hours
DatabaseDAX (DynamoDB Accelerator)Frequently read DynamoDB itemsMicroseconds (item cache), seconds (query cache)
DNSRoute 53 Resolver CacheDNS query resultsBased on record TTL
EC2 Instance Types: Choosing the Right ComputeLambda Performance Tuning: Serverless Performance Optimization

Pillar 5: Cost Optimization

Cost Optimization is about running systems to deliver business value at the lowest price point. This does not mean choosing the cheapest option for every resource. It means eliminating waste, selecting the right pricing models, right-sizing resources, and ensuring that every dollar of cloud spending delivers business value.

The cloud fundamentally changes the cost model from capital expenditure (CapEx) to operational expenditure (OpEx). This shift creates both opportunity (pay only for what you use) and risk (costs can spiral without governance). The Cost Optimization pillar provides a framework for managing this effectively.

Cost Optimization Practices

  • Implement cloud financial management: Establish a FinOps practice. Assign cost ownership to teams through tagging and separate accounts. Make costs visible and accountable.
  • Adopt a consumption model: Pay for what you use, not what you might need. Use auto-scaling, serverless, and on-demand pricing for variable workloads.
  • Measure overall efficiency: Track cost per business outcome (cost per transaction, cost per user, cost per GB processed) rather than just total spend.
  • Stop spending on undifferentiated heavy lifting: Use managed services instead of self-managing infrastructure. The operational cost of self-managed services often exceeds the managed service premium.
  • Analyze and attribute expenditure: Use Cost Explorer, AWS Budgets, and Cost and Usage Reports to understand where money goes and hold teams accountable.
cost-governance.sh
# Create budget with alerts at 80% and 100%
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "MonthlyTotal",
    "BudgetLimit": {"Amount": "10000", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "team@example.com"}]
    },
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "team@example.com"}]
    }
  ]'

# Enable Cost Anomaly Detection
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "ServiceMonitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

Cost Governance From Day One

Without governance, cloud costs will grow unchecked. Implement tagging strategies for cost allocation, set up AWS Budgets with automated alerts, and enable the Cost Anomaly Detection service before your first production workload goes live. The cost of retrofitting cost governance is much higher than building it in from the start , both in actual cloud spend and in organizational change management.

AWS Cost Optimization Strategies: Detailed Implementation Guide

Pillar 6: Sustainability

The newest pillar, Sustainability, focuses on minimizing the environmental impact of running cloud workloads. As cloud adoption grows, so does the energy consumption of data centers. AWS is committed to powering operations with 100% renewable energy by 2025, but you can further reduce your environmental footprint through efficient resource usage, right-sizing, and architectural decisions.

Sustainability Design Principles

  • Understand your impact: Use the AWS Customer Carbon Footprint Tool to track your carbon emissions. Establish a sustainability baseline and set reduction targets.
  • Establish sustainability goals: Define metrics for sustainability (carbon per transaction, energy per user) alongside traditional metrics.
  • Maximize utilization: Right-size instances, use auto-scaling to avoid idle resources, and use managed/serverless services that share infrastructure efficiently.
  • Adopt efficient technologies: Use Graviton (ARM) processors for 60% better energy efficiency. Choose purpose-built databases over general-purpose ones.
  • Reduce downstream impact: Compress data, optimize API payloads, and use CDN caching to reduce network traffic and energy consumption.
graviton-task-definition.json
{
  "family": "web-app",
  "runtimePlatform": {
    "cpuArchitecture": "ARM64",
    "operatingSystemFamily": "LINUX"
  },
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/app:latest",
      "portMappings": [
        { "containerPort": 8080, "protocol": "tcp" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/web-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "app"
        }
      }
    }
  ]
}

Graviton for Sustainability and Cost

Graviton (ARM64) processors deliver up to 40% better price-performance than comparable x86 instances while consuming up to 60% less energy for the same workload. Migrating to Graviton is one of the highest-impact actions you can take for both sustainability and cost optimization. Most Linux-based applications run without changes on ARM64. Use multi-arch Docker images to support both architectures during migration.

EC2 Instance Types: Graviton Processors and Efficiency

Conducting a Well-Architected Review

A Well-Architected review is a structured conversation about your workload's architecture. It is not an audit or a compliance check; it is a collaborative process to identify risks and improvement opportunities. Here is how to conduct an effective review:

Before the Review

  • Identify the workload to review (start with your most critical production workload)
  • Gather architecture diagrams, deployment documentation, and operational runbooks
  • Identify the right stakeholders (architects, developers, operations, security)
  • Create the workload in the Well-Architected Tool

During the Review

  • Walk through each pillar's questions with the team
  • Answer honestly. The value comes from identifying real gaps, not from claiming everything is perfect
  • Focus on high-risk issues first (items the tool marks as HRI)
  • Document specific improvement actions, not just findings

After the Review

  • Prioritize improvements based on risk and effort
  • Create tickets for each improvement action
  • Schedule follow-up reviews (quarterly recommended)
  • Track the security score and improvement rate over time

Well-Architected Lenses

In addition to the six core pillars, AWS provides Well-Architected Lenses that extend the framework with domain-specific guidance. Lenses are reviewed alongside the core pillars and provide targeted best practices for specific technology domains or industry verticals.

LensFocus AreaKey Topics
Serverless ApplicationsLambda, API Gateway, Step FunctionsCold starts, event-driven architecture, async patterns
SaaSMulti-tenant architecturesTenant isolation, noisy neighbor, metering
Machine LearningML workloads on AWSTraining pipelines, inference optimization, data management
Data AnalyticsData lakes and analyticsData governance, query optimization, real-time analytics
Container BuildECS, EKS, container CI/CDImage management, orchestration, security scanning
Financial ServicesIndustry-specific complianceRegulatory compliance, data residency, audit trails
ECS vs EKS Decision Guide: Container Orchestration Best PracticesDynamoDB Data Modeling: Purpose-Built Database Selection

Common Anti-Patterns Across Pillars

Certain architectural mistakes appear repeatedly across organizations. Recognizing these anti-patterns helps you avoid them proactively:

  • Manual changes to production: Console clicks instead of IaC lead to configuration drift, undocumented changes, and unreproducible environments.
  • Single point of failure: Single-AZ deployments, non-redundant databases, or architectural dependencies on a single service that has no failover.
  • Security as an afterthought: Deploying first and “hardening later” creates a security debt that grows exponentially. Build security in from day one.
  • Ignoring costs until the bill arrives: Not setting budgets, not tagging resources, and not monitoring spending until costs are already out of control.
  • Over-engineering for scale you do not have: Building multi-region active-active for a workload that serves 100 users. Right-size your architecture for your actual requirements.
  • Not testing recovery: Having backups but never testing restoration. Having a DR plan but never failing over. Having runbooks but never practicing them.

Key Takeaways

The Well-Architected Framework is not a one-time checklist but an ongoing practice. Conduct regular reviews as your workloads evolve; quarterly is ideal. Use the AWS Well-Architected Tool to track improvements over time. Start with the pillars most relevant to your current challenges but do not ignore any of them completely. Every architectural decision involves trade-offs between pillars, and the framework helps you make those trade-offs explicitly and intentionally. The greatest value comes not from achieving a perfect score but from the conversations the review process generates and the continuous improvement mindset it builds.

AWS Networking Deep Dive: Network Architecture Best PracticesMulti-Cloud Encryption Comparison: Data Protection Across Providers

Key Takeaways

  1. 1The framework has six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.
  2. 2Use the Well-Architected Tool in the AWS Console for free workload reviews.
  3. 3Trade-offs between pillars are inevitable, so understand and document your decisions.
  4. 4Regular reviews catch architecture drift and incorporate new AWS best practices.
  5. 5Well-Architected Lenses provide domain-specific guidance for SaaS, serverless, and more.
  6. 6The framework is a conversation tool, not a compliance checklist.

Frequently Asked Questions

What are the six pillars of the AWS Well-Architected Framework?
The six pillars are: Operational Excellence (run and monitor systems), Security (protect data and systems), Reliability (recover from failures), Performance Efficiency (use resources efficiently), Cost Optimization (avoid unnecessary costs), and Sustainability (minimize environmental impact).
How do I perform a Well-Architected Review?
Use the AWS Well-Architected Tool in the Console. Create a workload, answer questions for each pillar, and receive a report of high-risk issues and improvement recommendations. Reviews should be done quarterly or before major releases.
Is the Well-Architected Framework mandatory?
No, it is a set of best practices and recommendations, not a compliance requirement. However, following it significantly improves your architecture quality and is often expected during AWS solution reviews and partner validations.
What are Well-Architected Lenses?
Lenses are extensions that add domain-specific questions and best practices. Available lenses include Serverless, SaaS, Machine Learning, Data Analytics, IoT, and more. Custom lenses can be created for organization-specific requirements.
How does the Well-Architected Framework differ from the AWS CAF?
The Well-Architected Framework focuses on technical architecture best practices for workloads. The Cloud Adoption Framework (CAF) is broader, covering organizational change, governance, people, and business perspectives for cloud migration.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.