Generate Dataproc cluster configs with machine types, init actions, and autoscaling.
Last verified: May 2026
Generate Dataproc cluster configs with machine types, init actions, and autoscaling.
Required Fields
clusterNameregionmasterConfigmasterConfig.numInstancesworkerConfigOutput will appear here...The GCP Dataproc Cluster Config Builder helps you create configuration files for Google Cloud Dataproc clusters that run Apache Spark, Hadoop, and other open-source data processing frameworks. Dataproc cluster configuration involves selecting machine types, disk sizes, number of workers, autoscaling policies, initialization actions, and optional components. This tool generates the gcloud commands or Terraform configurations for provisioning optimized clusters for your data processing workloads.
Use Dataproc when you need to run existing Spark, Hadoop, or Hive workloads, or want full control over the cluster configuration. Use Dataflow (based on Apache Beam) for fully serverless, autoscaling stream and batch processing where you prefer not to manage cluster infrastructure.
Primary workers are standard persistent VMs that store HDFS data and run tasks. Secondary workers (preemptible/Spot) are cheaper VMs that can be reclaimed by Google at any time. They run tasks but do not store HDFS data, making them ideal for scaling compute capacity for fault-tolerant jobs.
Your team runs 30 nightly Spark jobs on a 24/7 Dataproc cluster (10 n1-standard-8 nodes), costing $5,800/month. The builder shows the alternative: ephemeral Dataproc Serverless batches that auto-scale per-job. After migration, total cost drops to $1,200/month (jobs run for ~6 hours total per night vs the 24-hour cluster), AND startup time per job goes from instant (cluster is ready) to ~90 seconds (Serverless cold start). For batch nightly jobs, the latency tradeoff is fine; cost savings are dramatic.
The builder generates Dataproc cluster configs with: master and worker node settings (machine type, boot disk size, accelerators), autoscaling policy reference, software config (image version, optional components, properties for Spark/Yarn/HDFS tuning), initialization actions, encryption config, and lifecycle policy (idle delete TTL). Output is gcloud dataproc clusters create commands and Terraform google_dataproc_cluster resources.
Dataproc Serverless (the newer offering) eliminates cluster management entirely for Spark workloads. For ad-hoc Spark jobs, batch processing, or workloads where you don't need persistent HDFS, Serverless is dramatically simpler AND cheaper than managing a cluster — no idle costs, autoscaling included.
Preemptible secondary workers can deliver 60-91% cost savings on Dataproc compute, but they can be reclaimed at any time. Always run primary workers (your HDFS data nodes) on standard VMs and preemptible only for compute scaling. Never set HDFS replication factor to 1 with mixed primary/preemptible — you'll lose data when preemptibles are reclaimed.
Initialization actions run on every node when the cluster starts. They're powerful for installing custom software but make cluster startup slow and brittle. For repeatable changes, use a custom Dataproc image instead of init actions — startup goes from 5+ minutes back to ~2 minutes and the customization is versioned.
Was this tool helpful?
Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.