Skip to main content
GCPDatabasesintermediate

Dataproc Guide

Guide to Dataproc covering cluster creation, autoscaling, Spark job submission, Dataproc Serverless for batch processing, component gateway, BigQuery integration, and workflow templates.

CloudToolStack Team22 min readPublished Mar 14, 2026

Prerequisites

  • Basic understanding of Apache Spark or Hadoop
  • Python or Java/Scala programming experience
  • GCP account with billing enabled

Introduction to Dataproc

Dataproc is Google Cloud's fully managed Apache Spark and Apache Hadoop service that lets you run big data workloads without managing cluster infrastructure. Dataproc clusters start in under 90 seconds, autoscale based on workload demand, and integrate natively with GCP services like Cloud Storage, BigQuery, Cloud Bigtable, and Pub/Sub. When your job is done, you can delete the cluster and stop paying immediately.

Dataproc supports the entire Hadoop ecosystem including Spark, Hive, Pig, Presto, Flink, and dozens of other open-source components. It also offers Dataproc Serverless for Spark, which eliminates cluster management entirely: you submit a Spark job and Dataproc provisions the exact resources needed, runs the job, and cleans up automatically. This makes Dataproc suitable for both persistent analytics clusters and ephemeral batch processing.

This guide covers creating and configuring clusters, submitting Spark jobs, enabling autoscaling, using Dataproc Serverless for batch workloads, accessing the component gateway for web UIs, and integrating with the GCP data ecosystem.

Dataproc Pricing

Dataproc charges a management fee of $0.010/vCPU/hr on top of the Compute Engine VM costs. A 4-worker cluster with n2-standard-4 VMs costs approximately $0.78/hr ($0.67 for VMs + $0.11 for Dataproc management). Using preemptible/Spot VMs for workers reduces costs by 60-80%. Dataproc Serverless charges $0.06/DCU-hour (Data Compute Unit) with no idle costs.

Creating Clusters

A Dataproc cluster consists of a master node (or 3 for HA), worker nodes, and optional secondary (preemptible) workers. You configure the machine types, disk sizes, and optional components at creation time. For production clusters, always use 3 master nodes for high availability and enable autoscaling on workers.

bash
# Enable the Dataproc API
gcloud services enable dataproc.googleapis.com

# Create a standard Spark cluster
gcloud dataproc clusters create my-spark-cluster \
  --region=us-central1 \
  --zone=us-central1-a \
  --master-machine-type=n2-standard-4 \
  --master-boot-disk-size=100GB \
  --num-workers=4 \
  --worker-machine-type=n2-standard-4 \
  --worker-boot-disk-size=200GB \
  --image-version=2.2-debian12 \
  --optional-components=JUPYTER,ZEPPELIN \
  --enable-component-gateway \
  --properties="spark:spark.executor.memory=4g,spark:spark.driver.memory=2g"

# Create a high-availability cluster (3 masters)
gcloud dataproc clusters create my-ha-cluster \
  --region=us-central1 \
  --num-masters=3 \
  --master-machine-type=n2-standard-4 \
  --num-workers=4 \
  --worker-machine-type=n2-standard-8 \
  --num-secondary-workers=4 \
  --secondary-worker-type=PREEMPTIBLE \
  --image-version=2.2-debian12 \
  --enable-component-gateway

# Create a single-node cluster for development
gcloud dataproc clusters create my-dev-cluster \
  --region=us-central1 \
  --single-node \
  --master-machine-type=n2-standard-4 \
  --image-version=2.2-debian12 \
  --optional-components=JUPYTER \
  --enable-component-gateway

# List clusters
gcloud dataproc clusters list --region=us-central1

# Describe a cluster
gcloud dataproc clusters describe my-spark-cluster \
  --region=us-central1 \
  --format="table(clusterName,status.state,config.masterConfig.numInstances,config.workerConfig.numInstances)"

Autoscaling

Dataproc autoscaling automatically adjusts the number of worker and secondary worker nodes based on YARN metrics (pending memory, available memory) and Spark metrics. This ensures your cluster has enough resources for demand spikes without overpaying during quiet periods. Autoscaling policies define the min/max nodes, cooldown periods, and scale-up/scale-down factors.

autoscaling-policy.yaml
workerConfig:
  minInstances: 2
  maxInstances: 10
  weight: 1
secondaryWorkerConfig:
  minInstances: 0
  maxInstances: 20
  weight: 1
basicAlgorithm:
  yarnConfig:
    scaleUpFactor: 1.0
    scaleDownFactor: 1.0
    scaleUpMinWorkerFraction: 0.0
    scaleDownMinWorkerFraction: 0.0
    gracefulDecommissionTimeout: 3600s
  cooldownPeriod: 120s
bash
# Create an autoscaling policy
gcloud dataproc autoscaling-policies import my-scaling-policy \
  --source=autoscaling-policy.yaml \
  --region=us-central1

# Create a cluster with autoscaling
gcloud dataproc clusters create my-autoscale-cluster \
  --region=us-central1 \
  --autoscaling-policy=my-scaling-policy \
  --master-machine-type=n2-standard-4 \
  --num-workers=2 \
  --worker-machine-type=n2-standard-4 \
  --image-version=2.2-debian12 \
  --enable-component-gateway

# Attach autoscaling to an existing cluster
gcloud dataproc clusters update my-spark-cluster \
  --region=us-central1 \
  --autoscaling-policy=my-scaling-policy

# Monitor autoscaling decisions
gcloud logging read \
  'resource.type="cloud_dataproc_cluster"
   resource.labels.cluster_name="my-autoscale-cluster"
   jsonPayload.message:"AutoscalerUpdate"' \
  --limit=20

Autoscaling Best Practices

Set the graceful decommission timeout to at least the length of your longest Spark stage to avoid losing work during scale-down. Use secondary (preemptible) workers for scale-out capacity and standard workers for the baseline. SetscaleDownMinWorkerFraction to 0.1-0.2 to prevent aggressive scale-down that causes repeated rescaling. Monitor the YARN Resource Manager UI through the component gateway to understand your cluster's resource utilization patterns.

Submitting Spark Jobs

You can submit Spark jobs to Dataproc using the gcloud CLI, REST API, or client libraries. Dataproc supports Spark, PySpark, SparkR, Spark SQL, Hive, Pig, and custom Hadoop jobs.

bash
# Submit a PySpark job
gcloud dataproc jobs submit pyspark \
  gs://my-bucket/scripts/word_count.py \
  --cluster=my-spark-cluster \
  --region=us-central1 \
  -- gs://my-bucket/input/ gs://my-bucket/output/

# Submit a Spark job (JAR)
gcloud dataproc jobs submit spark \
  --cluster=my-spark-cluster \
  --region=us-central1 \
  --class=com.example.MySparkApp \
  --jars=gs://my-bucket/jars/my-app-1.0.jar \
  --properties="spark.executor.instances=8,spark.executor.memory=4g" \
  -- --input gs://my-bucket/data --output gs://my-bucket/results

# Submit a Spark SQL job
gcloud dataproc jobs submit spark-sql \
  --cluster=my-spark-cluster \
  --region=us-central1 \
  --execute="SELECT COUNT(*) FROM parquet.\`gs://my-bucket/data/events/*.parquet\`"

# Submit a Hive job
gcloud dataproc jobs submit hive \
  --cluster=my-spark-cluster \
  --region=us-central1 \
  --file=gs://my-bucket/scripts/etl.hql

# Check job status
gcloud dataproc jobs list --region=us-central1 \
  --format="table(reference.jobId,status.state,placement.clusterName)"

# View job logs
gcloud dataproc jobs wait JOB_ID --region=us-central1
word_count.py (PySpark example)
from pyspark.sql import SparkSession
import sys

def main():
    if len(sys.argv) != 3:
        print("Usage: word_count.py <input_path> <output_path>")
        sys.exit(1)

    input_path = sys.argv[1]
    output_path = sys.argv[2]

    spark = SparkSession.builder \
        .appName("WordCount") \
        .getOrCreate()

    # Read text files from Cloud Storage
    text_df = spark.read.text(input_path)

    # Word count using DataFrame API
    from pyspark.sql.functions import explode, split, lower, col

    word_counts = (
        text_df
        .select(explode(split(lower(col("value")), "\\W+")).alias("word"))
        .filter(col("word") != "")
        .groupBy("word")
        .count()
        .orderBy(col("count").desc())
    )

    # Write results to Cloud Storage as Parquet
    word_counts.write.mode("overwrite").parquet(output_path)

    print(f"Top 10 words:")
    word_counts.show(10)

    spark.stop()

if __name__ == "__main__":
    main()

Dataproc Serverless for Spark

Dataproc Serverless eliminates cluster management entirely. You submit a Spark batch job and Dataproc automatically provisions the exact resources needed, runs the job, and cleans up. There are no clusters to create, scale, or delete. You pay only for the resources consumed during job execution, measured in Data Compute Units (DCUs).

bash
# Submit a serverless PySpark batch job
gcloud dataproc batches submit pyspark \
  gs://my-bucket/scripts/etl_pipeline.py \
  --region=us-central1 \
  --subnet=default \
  --deps-bucket=gs://my-bucket/deps \
  --py-files=gs://my-bucket/libs/utils.py \
  --properties="spark.executor.instances=10,spark.dynamicAllocation.enabled=true" \
  -- --date=2026-03-14 --output=gs://my-bucket/output

# Submit a serverless Spark SQL batch
gcloud dataproc batches submit spark-sql \
  --region=us-central1 \
  --subnet=default \
  --query="SELECT date, COUNT(*) as events FROM parquet.\`gs://my-bucket/data/*.parquet\` GROUP BY date" \
  --output-format=csv \
  --output-uri=gs://my-bucket/sql-output/

# Submit with custom container image
gcloud dataproc batches submit pyspark \
  gs://my-bucket/scripts/ml_training.py \
  --region=us-central1 \
  --subnet=default \
  --container-image=us-central1-docker.pkg.dev/MY_PROJECT/docker-repo/spark-ml:latest \
  --properties="spark.executor.memory=8g,spark.executor.cores=4"

# List batch jobs
gcloud dataproc batches list --region=us-central1

# Describe a batch job
gcloud dataproc batches describe BATCH_ID --region=us-central1

# Cancel a running batch
gcloud dataproc batches cancel BATCH_ID --region=us-central1

Serverless vs Managed Clusters

Use Dataproc Serverless for ad-hoc queries, scheduled ETL jobs, and workloads with variable demand. Use managed clusters for interactive analysis (Jupyter notebooks), long-running streaming jobs, workloads requiring specific Hadoop ecosystem components (Hive Metastore, HBase), or when you need fine-grained control over cluster configuration. Many organizations use both: serverless for batch ETL and managed clusters for interactive exploration.

Component Gateway and Web UIs

The component gateway provides secure, IAM-authenticated access to web UIs for cluster components like the Spark History Server, YARN Resource Manager, Jupyter, and Zeppelin. This eliminates the need to set up SSH tunnels or open firewall ports to access these interfaces.

bash
# Enable component gateway when creating a cluster
gcloud dataproc clusters create my-cluster \
  --region=us-central1 \
  --enable-component-gateway \
  --optional-components=JUPYTER,ZEPPELIN,HIVE_WEBHCAT

# Access web UIs through the gateway
# URLs are available in the cluster description:
gcloud dataproc clusters describe my-cluster \
  --region=us-central1 \
  --format="yaml(config.endpointConfig.httpPorts)"

# Typical endpoints:
# YARN ResourceManager: https://GATEWAY_URL/yarn/
# Spark History Server: https://GATEWAY_URL/sparkhistory/
# Jupyter: https://GATEWAY_URL/jupyter/
# Zeppelin: https://GATEWAY_URL/zeppelin/
# HDFS NameNode: https://GATEWAY_URL/hdfs/

# Install additional components via initialization actions
gcloud dataproc clusters create my-cluster \
  --region=us-central1 \
  --initialization-actions=gs://goog-dataproc-initialization-actions-us-central1/conda/bootstrap-conda.sh \
  --metadata="CONDA_PACKAGES=pandas scikit-learn matplotlib" \
  --enable-component-gateway

Integration with BigQuery and Cloud Storage

BigQuery integration from Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("BigQuery Integration") \
    .config("spark.jars", "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar") \
    .getOrCreate()

# Read from BigQuery
df = spark.read.format("bigquery") \
    .option("table", "bigquery-public-data.samples.shakespeare") \
    .load()

# Process with Spark
word_counts = df.groupBy("word").sum("word_count") \
    .withColumnRenamed("sum(word_count)", "total")

# Write results back to BigQuery
word_counts.write.format("bigquery") \
    .option("table", "MY_PROJECT.my_dataset.word_counts") \
    .option("temporaryGcsBucket", "my-temp-bucket") \
    .mode("overwrite") \
    .save()

# Read Parquet from Cloud Storage (use gs:// paths directly)
events = spark.read.parquet("gs://my-data-lake/events/year=2026/month=03/")

# Write to Cloud Storage
events.write.mode("overwrite") \
    .partitionBy("event_type") \
    .parquet("gs://my-data-lake/processed/")

Workflow Templates

Workflow templates define a reusable sequence of jobs that run on a cluster. The cluster can be pre-existing or created on demand by the workflow. This is ideal for scheduled ETL pipelines that spin up a cluster, run multiple jobs in sequence, and tear down the cluster when done.

bash
# Create a workflow template
gcloud dataproc workflow-templates create my-etl-workflow \
  --region=us-central1

# Set managed cluster configuration (ephemeral cluster)
gcloud dataproc workflow-templates set-managed-cluster my-etl-workflow \
  --region=us-central1 \
  --master-machine-type=n2-standard-4 \
  --num-workers=4 \
  --worker-machine-type=n2-standard-4 \
  --image-version=2.2-debian12

# Add jobs to the template
gcloud dataproc workflow-templates add-job pyspark \
  gs://my-bucket/scripts/extract.py \
  --workflow-template=my-etl-workflow \
  --region=us-central1 \
  --step-id=extract \
  -- --date=2026-03-14

gcloud dataproc workflow-templates add-job pyspark \
  gs://my-bucket/scripts/transform.py \
  --workflow-template=my-etl-workflow \
  --region=us-central1 \
  --step-id=transform \
  --start-after=extract \
  -- --date=2026-03-14

gcloud dataproc workflow-templates add-job pyspark \
  gs://my-bucket/scripts/load.py \
  --workflow-template=my-etl-workflow \
  --region=us-central1 \
  --step-id=load \
  --start-after=transform \
  -- --date=2026-03-14

# Instantiate (run) the workflow
gcloud dataproc workflow-templates instantiate my-etl-workflow \
  --region=us-central1

Cleanup

bash
# Delete clusters
gcloud dataproc clusters delete my-spark-cluster --region=us-central1 --quiet
gcloud dataproc clusters delete my-dev-cluster --region=us-central1 --quiet

# Delete autoscaling policies
gcloud dataproc autoscaling-policies delete my-scaling-policy \
  --region=us-central1 --quiet

# Delete workflow templates
gcloud dataproc workflow-templates delete my-etl-workflow \
  --region=us-central1 --quiet

# Cancel and delete batch jobs
gcloud dataproc batches list --region=us-central1 \
  --filter="state=RUNNING" \
  --format="value(name)" | while read batch; do
  gcloud dataproc batches cancel "${batch}" --quiet
done
GCP Cost OptimizationCloud Storage Classes & LifecyclePub/Sub Event-Driven Architecture

Key Takeaways

  1. 1Dataproc clusters start in under 90 seconds and can be deleted immediately after jobs complete.
  2. 2Autoscaling adjusts worker count based on YARN metrics, scaling from 2 to hundreds of nodes automatically.
  3. 3Dataproc Serverless eliminates cluster management entirely for Spark batch jobs with no idle costs.
  4. 4Preemptible/Spot secondary workers reduce compute costs by 60-80% for fault-tolerant workloads.
  5. 5Component gateway provides secure, IAM-authenticated access to Spark UI, YARN, Jupyter, and Zeppelin.
  6. 6Workflow templates define reusable ETL pipelines that spin up ephemeral clusters and run multi-step jobs.

Frequently Asked Questions

Should I use Dataproc or BigQuery for analytics?
Use BigQuery for SQL-based analytics on structured data. Use Dataproc for complex data processing that requires custom code (PySpark, Scala), machine learning pipelines, or Hadoop ecosystem tools (Hive, HBase, Presto). Many organizations use both: Dataproc for ETL and BigQuery for serving analytics.
When should I use Dataproc Serverless vs managed clusters?
Use Serverless for scheduled ETL batch jobs, ad-hoc queries, and variable workloads where you want zero idle cost. Use managed clusters for interactive analysis (Jupyter), streaming jobs, workloads requiring specific Hadoop components, or when you need fine-grained cluster configuration.
How do I optimize Dataproc costs?
Use autoscaling with preemptible secondary workers. Use ephemeral clusters (create, run job, delete) instead of persistent clusters. Use Dataproc Serverless for batch jobs. Right-size machine types based on actual resource usage. Store data in Cloud Storage (not HDFS) so clusters can be deleted without data loss.

Written by CloudToolStack Team

Cloud engineers and architects with hands-on experience across AWS, Azure, and GCP. We write guides based on real-world production patterns, not just documentation rewrites.

Disclaimer: This guide is for educational purposes. Cloud services change frequently; always refer to official documentation for the latest information. AWS, Azure, and GCP are trademarks of their respective owners.