Run jobs on a Spark Cluster with spark-submit or pyspark¶
In this tutorial you will learn how to start a Spark cluster on HPC compute nodes and then run
Spark jobs with spark-submit
or interactively with pyspark
.
The key difference between this and other tutorials is that this workflow gives you the ability
to customize all aspects of the Spark configuration when you launch your application. Refer to
the CLI help, e.g. spark-submit --help
, for details on how to set these options.
Install a Spark client, such as
pyspark
, such thatspark-submit
andpyspark
are available. For example,$ pip install "sparkctl[pyspark]"
Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory for the Spark master process and user application + Spark driver and 2 complete nodes for Spark workers.
$ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
Activate the Python environment that contains sparkctl.
$ module load python $ source ~/python-envs/sparkctl
Configure the Spark cluster. The sparkctl code will detect the compute nodes based on Slurm environment variables.
$ sparkctl configure
Optional, inpect the Spark configuration in
./conf
.Start the cluster.
$ sparkctl start
Set the environment variables
SPARK_CONF_DIR
andJAVA_HOME
. This will ensure that your application uses the Spark settings created in step 2. Instructions will be printed to the console. For example:$ export SPARK_CONF_DIR=$(pwd)/conf $ export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7
Run your application. The recommended behavior is to launch your application through
spark-submit
:$ spark-submit --master spark://$(hostname):7077 my-job.py
Alternatively, if you want to run your jobs interactively, you can use
pyspark
:$ pyspark --master spark://$(hostname):7077
>>> df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a","b"]) >>> df.show()
Optional, create a SparkSession in your own Python script. This is not recommended unless you want to set breakpoints inside your code.
$ python my-job.py
For this to work, you may need to set the environment variable
PYSPARK_PYTHON
to the path to your python executable. Otherwise, the Spark workers may try to use the version of Python included in the Spark distribution, which likely won’t be compatible.$ export PYSPARK_PYTHON=$(which python)
Shut down the cluster.
$ sparkctl stop