Compare data lake architectures and services across AWS, Azure, GCP, and OCI.
Showing 20 of 20 features.
| Feature | AWS | Azure | GCP | OCI |
|---|---|---|---|---|
Data Lake Service Core Platform | AWS Lake Formation + S3 | Azure Data Lake Storage Gen2 (ADLS Gen2) | BigQuery + Cloud Storage (BigLake) | OCI Data Lake (Lakehouse) + Object Storage |
Underlying Storage Core Platform | Amazon S3 with unlimited scalability | Azure Blob Storage with hierarchical namespace (HNS) | Cloud Storage (GCS) buckets with BigLake tables | OCI Object Storage with tiered storage classes |
Data Catalog Core Platform | AWS Glue Data Catalog (Hive-compatible metastore) | Microsoft Purview (formerly Azure Purview) | Dataplex with Data Catalog for metadata management | OCI Data Catalog with business glossary and harvesting |
Pricing Model Core Platform | S3 storage + Lake Formation (no extra charge) + query costs | ADLS Gen2 storage + analytics service costs | Cloud Storage + BigQuery analysis pricing | Object Storage + Data Flow / autonomous pricing |
Open Table Format Support Core Platform | Apache Iceberg (native), Hudi, Delta Lake via Glue/EMR | Delta Lake (native via Databricks/Synapse), Iceberg preview | Apache Iceberg (BigLake native), Hudi, Delta via Dataproc | Apache Iceberg and Delta Lake via Data Flow (Spark) |
File Formats Storage & Formats | Parquet, ORC, Avro, JSON, CSV; Glue crawlers auto-detect | Parquet, ORC, Avro, JSON, CSV, Delta; auto-schema detection | Parquet, ORC, Avro, JSON, CSV; BigQuery native format | Parquet, ORC, Avro, JSON, CSV via Data Flow Spark |
Storage Tiering Storage & Formats | S3 Standard, IA, Glacier, Deep Archive with lifecycle rules | Hot, Cool, Cold, Archive tiers with lifecycle management | Standard, Nearline, Coldline, Archive with lifecycle rules | Standard, Infrequent Access, Archive with lifecycle policies |
Data Partitioning Storage & Formats | Hive-style partitioning; Glue partition indexes | Hive-style partitions; Synapse partition pruning | BigQuery native partitioning + clustering; Hive on GCS | Hive-style partitioning via Data Flow Spark |
Compression Storage & Formats | Snappy, GZIP, LZO, ZSTD; automatic with Glue ETL | Snappy, GZIP, LZ4, ZSTD; configurable in pipelines | Snappy, GZIP, ZSTD; automatic BigQuery compression | Snappy, GZIP, LZ4, ZSTD via Spark configuration |
Max Object Size Storage & Formats | 5 TB per S3 object (multipart upload) | ~4.75 TB per blob (block blob) | 5 TB per GCS object (composite objects for larger) | 10 TB per object (multipart upload) |
Serverless Query Engine Processing & Analytics | Athena (Presto/Trino) for ad-hoc SQL queries on S3 | Synapse Serverless SQL for ad-hoc queries on ADLS | BigQuery serverless SQL; BigLake for external data | Autonomous Data Warehouse for serverless SQL queries |
Spark Processing Processing & Analytics | EMR (managed Hadoop/Spark), Glue Spark ETL jobs | Synapse Spark pools, Azure Databricks, HDInsight | Dataproc (managed Spark/Hadoop), Dataproc Serverless | OCI Data Flow (fully managed Apache Spark) |
Streaming Ingestion Processing & Analytics | Kinesis Data Firehose to S3; Kinesis Data Streams + Lambda | Event Hubs Capture to ADLS; Stream Analytics | Pub/Sub to BigQuery/GCS; Dataflow streaming | OCI Streaming (Kafka-compatible) to Object Storage |
ETL / ELT Processing & Analytics | Glue ETL (Spark), Glue DataBrew (visual), Step Functions | Data Factory, Synapse Pipelines, Mapping Data Flows | Dataflow (Apache Beam), Dataproc, BigQuery SQL transforms | Data Integration, Data Flow, Golden Gate for replication |
ML Integration Processing & Analytics | SageMaker reads directly from S3 data lake | Azure ML reads from ADLS Gen2; Synapse ML integration | Vertex AI reads from BigQuery and GCS; BigQuery ML | OCI Data Science reads from Object Storage and ADW |
Fine-Grained Access Control Governance & Security | Lake Formation column/row/cell-level permissions | Purview access policies; Synapse column-level security | BigLake column/row-level security; Dataplex policies | Data Catalog policies; ADW column masking and privileges |
Data Lineage Governance & Security | Lake Formation data lineage (preview); Glue lineage | Microsoft Purview data lineage across services | Dataplex lineage; Data Catalog lineage integration | Data Catalog lineage via harvesting and custom metadata |
Data Quality Governance & Security | Glue Data Quality rules and recommendations | Purview data quality (preview); Azure Data Factory DQ | Dataplex Data Quality auto-profiling and rules | Data Integration data quality tasks; custom validation |
Encryption Governance & Security | SSE-S3, SSE-KMS, SSE-C; client-side encryption option | Microsoft-managed or customer-managed keys via Key Vault | Google-managed or CMEK via Cloud KMS | Oracle-managed or customer-managed keys via OCI Vault |
Audit & Compliance Governance & Security | CloudTrail + S3 access logs; Lake Formation audit logging | Azure Monitor diagnostic logs; Purview audit trail | Cloud Audit Logs; BigQuery audit logs; VPC Service Controls | OCI Audit service; Object Storage access logs |
Data lake architectures on the cloud build on object storage (S3, Azure Blob/ADLS Gen2, Cloud Storage, OCI Object Storage) combined with analytics engines, metadata catalogs, and governance layers. Each cloud offers different approaches — AWS Lake Formation for centralized governance, Azure Synapse with ADLS Gen2 for integrated analytics, GCP BigLake for unified storage access, and OCI Data Lakehouse for combined lake and warehouse workloads. This comparison examines the storage foundations, governance models, query engines, table formats (Iceberg, Delta, Hudi), and cost structures across all four clouds.
Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.