Build Data Flow (Spark) application configurations with driver/executor shapes and parameters.
Last verified: May 2026
Build Data Flow (Spark) application configurations with driver/executor shapes, parameters, and private endpoints.
Required Fields
compartmentIddisplayNamelanguagesparkVersionfileUridriverShapeexecutorShapenumExecutorsOutput will appear here...The builder constructs OCI Data Flow application configurations: application resource (compartment, name, Spark version, language: SPARK_SCALA / SPARK_PYTHON / SPARK_JAVA / SPARK_SQL), file URI (the application JAR/Python/SQL file in Object Storage), driver shape + count, executor shape + count, parameter overrides, and warehouse Object Storage bucket for output. Output is generated as oci data-flow commands and Terraform oci_dataflow_application + oci_dataflow_run resources.
Build Data Flow (Spark) application configurations with driver/executor shapes and parameters. This tool helps OCI engineers generate valid configurations quickly without consulting documentation, reducing errors and accelerating infrastructure deployment. All processing runs in your browser with no data sent to external servers.
Your data team's nightly Spark ETL on a self-managed EMR cluster costs $1,800/month always-on. Most jobs only run for 2-3 hours. The builder generates a Data Flow application: PySpark code in Object Storage, driver VM.Standard.E5.Flex 4 OCPU + executor VM.Standard.E5.Flex 8 OCPU × 5 instances, runs on-demand triggered by Object Storage events. New monthly cost: ~$200 (only 2-3 hours/day vs 24/7). Annual savings: $19K. Plus elimination of EMR cluster maintenance.
OCI Data Flow is managed Spark — eliminates the operational burden of running Spark clusters. Submit job → Data Flow provisions cluster → runs job → tears down. You pay only for actual job runtime. Dramatically simpler than self-managed EMR/Dataproc.
Driver/executor shape choice matters for cost. Small workloads: VM.Standard.E5.Flex with 4 OCPU. Large analytics: VM.Standard.E5.Flex with 16+ OCPU + 128 GB RAM. GPU workloads (rare for Spark): BM.GPU.A10. Right-size based on actual job profile, not 'just in case' over-provisioning.
Use Spot instances for fault-tolerant Spark workloads — Data Flow handles preemption gracefully by re-running affected tasks. Combined with Spot pricing (60% discount), this can cut large analytics job costs significantly.
Most Data Flow Application primitives behave the same in commercial and Government Cloud OCI, but the OCID realm differs, region availability is limited, and a handful of services are unavailable. The output is portable in shape; you must adjust realm and verify service availability before applying in a Government Cloud tenancy.
It produces structurally valid output for the OCI schemas it supports. We still recommend running provider validation locally before applying — schemas evolve and a recently-released property may not yet be reflected. When validation does fail, the error points at the exact attribute the schema rejected.
Was this tool helpful?
Disclaimer: This tool runs entirely in your browser. No data is sent to our servers. Always verify outputs before using them in production. AWS, Azure, and GCP are trademarks of their respective owners.