Creating Parameterized Spark Jobs on Ephemeral DataProc Clusters
On-demand spark jobs with DataProc helps to democratize spark jobs. For a while, the common and accessible options were running Spark through a notebook environment, like Databricks or Palantir, or by standing up your own spark cluster, an exercise too tedious for teams. Recently, I worked with Google’s DataProc service, and it helps democratize spark a bit more — on-demand spark jobs and clusters are much easier to kick off or stand up. But only at first.
When I had to parameterize my job and put it into production, it set me back a whole extra week due to confusing documentation and an unclear vision on what was possible with DataProc and what wasn’t.
This guide aims to show you how to easily parameterize a spark job on GCP DataProc and then run it on an ephemeral cluster. For more information on the basics of DataProc, I would recommend this guide by Gary Stafford
First, some background terminology.
- Ephemeral Clusters are spark clusters/compute that are by default always-off. They are kicked off by some kind of initialization command, and when the script is done, the cluster turns back off again. Super useful if you’re forgetful with the off button like me.
- Workflow Templates are essentially a pre-configured directions for a cluster, oftentimes used for ephemeral cluster jobs. You can think about it kind of like IAC for a spark cluster, it tells GCP things like the clusters name, how much compute to give it, etc.
I want to start by noting that you can create a DataProc Cluster, turn it on, and then submit a job with parameters. That’s completely possible, without the below steps. We opted to avoid that path because it violated a fundamental design principle: spinning up production infrastructure using CLI commands in AirFlow rather than using IAC. Workflow templates were a good workaround because they restricted the operation down to kicking off a predefined script, rather than creating compute.
So, we decided on parameterizing ephemeral clusters. The only issue is that there’s a chicken and the egg situation with workflow templates — you can’t add a parameterized job to a non-parameterized workflow template, and you can’t parameterize a workflow template without having it point to a parameterized job. It just doesn’t work. Trust me, I’ve tried every combination I can think of, and every error message led to a dead end. The only way to parameterize an ephemeral job is by importing a workflow template from a YAML file.
So, how do you parameterize an ephemeral job? Below are a few code snippets which I hope will save
Defining the Workflow Template YAML
First, you need to generate a YAML file to describe the cluster. I broke this out into three separate parts, and will combine the YAML file at the end — feel free to skip to that.
First, describe the cluster -
Next, add in information about the job, including information about the jobs properties, which essentially allow us to parameterize things.
Finally, we add in a parameters section to the YAML file, which allows the workflow template and its corresponding job to handshake on variables.
Our workflow template is now done, and it looks something like this -
Some important gotchas -
- In the jobs section of the workflow template, variables are called properties, but in the parameters section variables are called parameters. This leads to confusion, but the reasoning is the variable is a “property of a job”, but a “parameter within the cluster”.
- You need to make sure that the stepId in the jobs section matches the step name in the parameters section. In our case, the stepId is STEP_NAME_123, but you can name it whatever you want as long as it’s consistent.
- Parameters need to be UPPERCASE.
Using the Workflow Template
With our YAML file finished, we can import it — importing a YAML file creates a workflow template.
gcloud dataproc workflow-templates import $WORKFLOW_TEMPLATE_NAME \
- region=us-central1 \
- source=workflow-template-file-path.yaml
The CLI command should tell us if the workflow import was successful. You can also check in the GCP dataproc console, under workflows -> workflow templates — your new workflow template should now exist.
Next, instantiate the workflow-template. Instantiation does two things: first, it starts up whatever cluster is defined in the workflow template. Then, it starts off the job(s) described in the template.
Assuming the variables MY_VAR1 and MY_VAR2 described above, the command would look like this -
gcloud dataproc workflow-templates instantiate $WORKFLOW_TEMPLATE_NAME \
- region=us-central1 \
- parameters="MY_VAR1=123,MY_VAR2=456"
Once again, the CLI output should let us know if the job is right. We can also go into the DataProc console and see the status under jobs. There, we should be able to see logs and help us debug further.
The last tricky part was understanding how to access the variables we’re passing in — they’ll be in the spark context. Below is an example script for those same variables MY_VAR1 and MY_VAR2. A lot of the SparkSession Builder code is just fluff, and may be different in your cases.
I didn’t limit test how many variables you can pass in, but you can definitely scale greater than two variables — we use five. You can also use a similar strategy to directly set spark options. We needed to pass in a spark driver option to use a third-party integration, and that variable handoff looked a little different.
YAML parameterization -
jobs:
- pysparkJob:
fileUris:
mainPythonFileUri: your_pyspark_script.py
properties:
spark.driver.extraJavaOptions: ''
parameters:
- name: KEY
fields:
- jobs['STEP_NUM_123'].pysparkJob.properties['spark.driver.extraJavaOptions']
And passing it in through the CLI —
gcloud dataproc workflow-templates instantiate $WORKFLOW_TEMPLATE_NAME \
- region=us-central1 \
- parameters="KEY=-Djsl.settings.license=MY_KEY_VALUE"
More information on extraJavaOptions to help you find that variable name.
Happy parameterizing!