Dataproc Guide
Guide to Dataproc covering cluster creation, autoscaling, Spark job submission, Dataproc Serverless for batch processing, component gateway, BigQuery integration, and workflow templates.
Prerequisites
- Basic understanding of Apache Spark or Hadoop
- Python or Java/Scala programming experience
- GCP account with billing enabled
Introduction to Dataproc
Dataproc is Google Cloud's fully managed Apache Spark and Apache Hadoop service that lets you run big data workloads without managing cluster infrastructure. Dataproc clusters start in under 90 seconds, autoscale based on workload demand, and integrate natively with GCP services like Cloud Storage, BigQuery, Cloud Bigtable, and Pub/Sub. When your job is done, you can delete the cluster and stop paying immediately.
Dataproc supports the entire Hadoop ecosystem including Spark, Hive, Pig, Presto, Flink, and dozens of other open-source components. It also offers Dataproc Serverless for Spark, which eliminates cluster management entirely: you submit a Spark job and Dataproc provisions the exact resources needed, runs the job, and cleans up automatically. This makes Dataproc suitable for both persistent analytics clusters and ephemeral batch processing.
This guide covers creating and configuring clusters, submitting Spark jobs, enabling autoscaling, using Dataproc Serverless for batch workloads, accessing the component gateway for web UIs, and integrating with the GCP data ecosystem.
Dataproc Pricing
Dataproc charges a management fee of $0.010/vCPU/hr on top of the Compute Engine VM costs. A 4-worker cluster with n2-standard-4 VMs costs approximately $0.78/hr ($0.67 for VMs + $0.11 for Dataproc management). Using preemptible/Spot VMs for workers reduces costs by 60-80%. Dataproc Serverless charges $0.06/DCU-hour (Data Compute Unit) with no idle costs.
Creating Clusters
A Dataproc cluster consists of a master node (or 3 for HA), worker nodes, and optional secondary (preemptible) workers. You configure the machine types, disk sizes, and optional components at creation time. For production clusters, always use 3 master nodes for high availability and enable autoscaling on workers.
# Enable the Dataproc API
gcloud services enable dataproc.googleapis.com
# Create a standard Spark cluster
gcloud dataproc clusters create my-spark-cluster \
--region=us-central1 \
--zone=us-central1-a \
--master-machine-type=n2-standard-4 \
--master-boot-disk-size=100GB \
--num-workers=4 \
--worker-machine-type=n2-standard-4 \
--worker-boot-disk-size=200GB \
--image-version=2.2-debian12 \
--optional-components=JUPYTER,ZEPPELIN \
--enable-component-gateway \
--properties="spark:spark.executor.memory=4g,spark:spark.driver.memory=2g"
# Create a high-availability cluster (3 masters)
gcloud dataproc clusters create my-ha-cluster \
--region=us-central1 \
--num-masters=3 \
--master-machine-type=n2-standard-4 \
--num-workers=4 \
--worker-machine-type=n2-standard-8 \
--num-secondary-workers=4 \
--secondary-worker-type=PREEMPTIBLE \
--image-version=2.2-debian12 \
--enable-component-gateway
# Create a single-node cluster for development
gcloud dataproc clusters create my-dev-cluster \
--region=us-central1 \
--single-node \
--master-machine-type=n2-standard-4 \
--image-version=2.2-debian12 \
--optional-components=JUPYTER \
--enable-component-gateway
# List clusters
gcloud dataproc clusters list --region=us-central1
# Describe a cluster
gcloud dataproc clusters describe my-spark-cluster \
--region=us-central1 \
--format="table(clusterName,status.state,config.masterConfig.numInstances,config.workerConfig.numInstances)"Autoscaling
Dataproc autoscaling automatically adjusts the number of worker and secondary worker nodes based on YARN metrics (pending memory, available memory) and Spark metrics. This ensures your cluster has enough resources for demand spikes without overpaying during quiet periods. Autoscaling policies define the min/max nodes, cooldown periods, and scale-up/scale-down factors.
workerConfig:
minInstances: 2
maxInstances: 10
weight: 1
secondaryWorkerConfig:
minInstances: 0
maxInstances: 20
weight: 1
basicAlgorithm:
yarnConfig:
scaleUpFactor: 1.0
scaleDownFactor: 1.0
scaleUpMinWorkerFraction: 0.0
scaleDownMinWorkerFraction: 0.0
gracefulDecommissionTimeout: 3600s
cooldownPeriod: 120s# Create an autoscaling policy
gcloud dataproc autoscaling-policies import my-scaling-policy \
--source=autoscaling-policy.yaml \
--region=us-central1
# Create a cluster with autoscaling
gcloud dataproc clusters create my-autoscale-cluster \
--region=us-central1 \
--autoscaling-policy=my-scaling-policy \
--master-machine-type=n2-standard-4 \
--num-workers=2 \
--worker-machine-type=n2-standard-4 \
--image-version=2.2-debian12 \
--enable-component-gateway
# Attach autoscaling to an existing cluster
gcloud dataproc clusters update my-spark-cluster \
--region=us-central1 \
--autoscaling-policy=my-scaling-policy
# Monitor autoscaling decisions
gcloud logging read \
'resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name="my-autoscale-cluster"
jsonPayload.message:"AutoscalerUpdate"' \
--limit=20Autoscaling Best Practices
Set the graceful decommission timeout to at least the length of your longest Spark stage to avoid losing work during scale-down. Use secondary (preemptible) workers for scale-out capacity and standard workers for the baseline. SetscaleDownMinWorkerFraction to 0.1-0.2 to prevent aggressive scale-down that causes repeated rescaling. Monitor the YARN Resource Manager UI through the component gateway to understand your cluster's resource utilization patterns.
Submitting Spark Jobs
You can submit Spark jobs to Dataproc using the gcloud CLI, REST API, or client libraries. Dataproc supports Spark, PySpark, SparkR, Spark SQL, Hive, Pig, and custom Hadoop jobs.
# Submit a PySpark job
gcloud dataproc jobs submit pyspark \
gs://my-bucket/scripts/word_count.py \
--cluster=my-spark-cluster \
--region=us-central1 \
-- gs://my-bucket/input/ gs://my-bucket/output/
# Submit a Spark job (JAR)
gcloud dataproc jobs submit spark \
--cluster=my-spark-cluster \
--region=us-central1 \
--class=com.example.MySparkApp \
--jars=gs://my-bucket/jars/my-app-1.0.jar \
--properties="spark.executor.instances=8,spark.executor.memory=4g" \
-- --input gs://my-bucket/data --output gs://my-bucket/results
# Submit a Spark SQL job
gcloud dataproc jobs submit spark-sql \
--cluster=my-spark-cluster \
--region=us-central1 \
--execute="SELECT COUNT(*) FROM parquet.\`gs://my-bucket/data/events/*.parquet\`"
# Submit a Hive job
gcloud dataproc jobs submit hive \
--cluster=my-spark-cluster \
--region=us-central1 \
--file=gs://my-bucket/scripts/etl.hql
# Check job status
gcloud dataproc jobs list --region=us-central1 \
--format="table(reference.jobId,status.state,placement.clusterName)"
# View job logs
gcloud dataproc jobs wait JOB_ID --region=us-central1from pyspark.sql import SparkSession
import sys
def main():
if len(sys.argv) != 3:
print("Usage: word_count.py <input_path> <output_path>")
sys.exit(1)
input_path = sys.argv[1]
output_path = sys.argv[2]
spark = SparkSession.builder \
.appName("WordCount") \
.getOrCreate()
# Read text files from Cloud Storage
text_df = spark.read.text(input_path)
# Word count using DataFrame API
from pyspark.sql.functions import explode, split, lower, col
word_counts = (
text_df
.select(explode(split(lower(col("value")), "\\W+")).alias("word"))
.filter(col("word") != "")
.groupBy("word")
.count()
.orderBy(col("count").desc())
)
# Write results to Cloud Storage as Parquet
word_counts.write.mode("overwrite").parquet(output_path)
print(f"Top 10 words:")
word_counts.show(10)
spark.stop()
if __name__ == "__main__":
main()Dataproc Serverless for Spark
Dataproc Serverless eliminates cluster management entirely. You submit a Spark batch job and Dataproc automatically provisions the exact resources needed, runs the job, and cleans up. There are no clusters to create, scale, or delete. You pay only for the resources consumed during job execution, measured in Data Compute Units (DCUs).
# Submit a serverless PySpark batch job
gcloud dataproc batches submit pyspark \
gs://my-bucket/scripts/etl_pipeline.py \
--region=us-central1 \
--subnet=default \
--deps-bucket=gs://my-bucket/deps \
--py-files=gs://my-bucket/libs/utils.py \
--properties="spark.executor.instances=10,spark.dynamicAllocation.enabled=true" \
-- --date=2026-03-14 --output=gs://my-bucket/output
# Submit a serverless Spark SQL batch
gcloud dataproc batches submit spark-sql \
--region=us-central1 \
--subnet=default \
--query="SELECT date, COUNT(*) as events FROM parquet.\`gs://my-bucket/data/*.parquet\` GROUP BY date" \
--output-format=csv \
--output-uri=gs://my-bucket/sql-output/
# Submit with custom container image
gcloud dataproc batches submit pyspark \
gs://my-bucket/scripts/ml_training.py \
--region=us-central1 \
--subnet=default \
--container-image=us-central1-docker.pkg.dev/MY_PROJECT/docker-repo/spark-ml:latest \
--properties="spark.executor.memory=8g,spark.executor.cores=4"
# List batch jobs
gcloud dataproc batches list --region=us-central1
# Describe a batch job
gcloud dataproc batches describe BATCH_ID --region=us-central1
# Cancel a running batch
gcloud dataproc batches cancel BATCH_ID --region=us-central1Serverless vs Managed Clusters
Use Dataproc Serverless for ad-hoc queries, scheduled ETL jobs, and workloads with variable demand. Use managed clusters for interactive analysis (Jupyter notebooks), long-running streaming jobs, workloads requiring specific Hadoop ecosystem components (Hive Metastore, HBase), or when you need fine-grained control over cluster configuration. Many organizations use both: serverless for batch ETL and managed clusters for interactive exploration.
Component Gateway and Web UIs
The component gateway provides secure, IAM-authenticated access to web UIs for cluster components like the Spark History Server, YARN Resource Manager, Jupyter, and Zeppelin. This eliminates the need to set up SSH tunnels or open firewall ports to access these interfaces.
# Enable component gateway when creating a cluster
gcloud dataproc clusters create my-cluster \
--region=us-central1 \
--enable-component-gateway \
--optional-components=JUPYTER,ZEPPELIN,HIVE_WEBHCAT
# Access web UIs through the gateway
# URLs are available in the cluster description:
gcloud dataproc clusters describe my-cluster \
--region=us-central1 \
--format="yaml(config.endpointConfig.httpPorts)"
# Typical endpoints:
# YARN ResourceManager: https://GATEWAY_URL/yarn/
# Spark History Server: https://GATEWAY_URL/sparkhistory/
# Jupyter: https://GATEWAY_URL/jupyter/
# Zeppelin: https://GATEWAY_URL/zeppelin/
# HDFS NameNode: https://GATEWAY_URL/hdfs/
# Install additional components via initialization actions
gcloud dataproc clusters create my-cluster \
--region=us-central1 \
--initialization-actions=gs://goog-dataproc-initialization-actions-us-central1/conda/bootstrap-conda.sh \
--metadata="CONDA_PACKAGES=pandas scikit-learn matplotlib" \
--enable-component-gatewayIntegration with BigQuery and Cloud Storage
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("BigQuery Integration") \
.config("spark.jars", "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar") \
.getOrCreate()
# Read from BigQuery
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.samples.shakespeare") \
.load()
# Process with Spark
word_counts = df.groupBy("word").sum("word_count") \
.withColumnRenamed("sum(word_count)", "total")
# Write results back to BigQuery
word_counts.write.format("bigquery") \
.option("table", "MY_PROJECT.my_dataset.word_counts") \
.option("temporaryGcsBucket", "my-temp-bucket") \
.mode("overwrite") \
.save()
# Read Parquet from Cloud Storage (use gs:// paths directly)
events = spark.read.parquet("gs://my-data-lake/events/year=2026/month=03/")
# Write to Cloud Storage
events.write.mode("overwrite") \
.partitionBy("event_type") \
.parquet("gs://my-data-lake/processed/")Workflow Templates
Workflow templates define a reusable sequence of jobs that run on a cluster. The cluster can be pre-existing or created on demand by the workflow. This is ideal for scheduled ETL pipelines that spin up a cluster, run multiple jobs in sequence, and tear down the cluster when done.
# Create a workflow template
gcloud dataproc workflow-templates create my-etl-workflow \
--region=us-central1
# Set managed cluster configuration (ephemeral cluster)
gcloud dataproc workflow-templates set-managed-cluster my-etl-workflow \
--region=us-central1 \
--master-machine-type=n2-standard-4 \
--num-workers=4 \
--worker-machine-type=n2-standard-4 \
--image-version=2.2-debian12
# Add jobs to the template
gcloud dataproc workflow-templates add-job pyspark \
gs://my-bucket/scripts/extract.py \
--workflow-template=my-etl-workflow \
--region=us-central1 \
--step-id=extract \
-- --date=2026-03-14
gcloud dataproc workflow-templates add-job pyspark \
gs://my-bucket/scripts/transform.py \
--workflow-template=my-etl-workflow \
--region=us-central1 \
--step-id=transform \
--start-after=extract \
-- --date=2026-03-14
gcloud dataproc workflow-templates add-job pyspark \
gs://my-bucket/scripts/load.py \
--workflow-template=my-etl-workflow \
--region=us-central1 \
--step-id=load \
--start-after=transform \
-- --date=2026-03-14
# Instantiate (run) the workflow
gcloud dataproc workflow-templates instantiate my-etl-workflow \
--region=us-central1Cleanup
# Delete clusters
gcloud dataproc clusters delete my-spark-cluster --region=us-central1 --quiet
gcloud dataproc clusters delete my-dev-cluster --region=us-central1 --quiet
# Delete autoscaling policies
gcloud dataproc autoscaling-policies delete my-scaling-policy \
--region=us-central1 --quiet
# Delete workflow templates
gcloud dataproc workflow-templates delete my-etl-workflow \
--region=us-central1 --quiet
# Cancel and delete batch jobs
gcloud dataproc batches list --region=us-central1 \
--filter="state=RUNNING" \
--format="value(name)" | while read batch; do
gcloud dataproc batches cancel "${batch}" --quiet
doneKey Takeaways
- 1Dataproc clusters start in under 90 seconds and can be deleted immediately after jobs complete.
- 2Autoscaling adjusts worker count based on YARN metrics, scaling from 2 to hundreds of nodes automatically.
- 3Dataproc Serverless eliminates cluster management entirely for Spark batch jobs with no idle costs.
- 4Preemptible/Spot secondary workers reduce compute costs by 60-80% for fault-tolerant workloads.
- 5Component gateway provides secure, IAM-authenticated access to Spark UI, YARN, Jupyter, and Zeppelin.
- 6Workflow templates define reusable ETL pipelines that spin up ephemeral clusters and run multi-step jobs.
Frequently Asked Questions
Should I use Dataproc or BigQuery for analytics?
When should I use Dataproc Serverless vs managed clusters?
How do I optimize Dataproc costs?
Written by CloudToolStack Team
Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.
Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.