Run Python jobs on a Spark Cluster in a scriptΒΆ
In this tutorial you will learn how to start a Spark cluster on HPC compute nodes and then run
Spark jobs in Python through pyspark-client
with the Spark Connect Server in a script.
The key difference between this and other tutorials is that this tutorial uses sparkctl
as a
Python library to hide the details of starting the cluster and setting environment variables.
Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory for the Spark master process and user application + Spark driver and 2 complete nodes for Spark workers.
$ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
Activate the Python environment that contains sparkctl.
$ module load python $ source ~/python-envs/sparkctl
Add the code below to a Python script. This code block will configure and start the Spark cluster, run your Spark job, and then stop the cluster.
from sparkctl import ClusterManager, make_default_spark_config # This loads your global sparkctl configuration file (~/.sparkctl.toml). config = make_default_spark_config() # Set runtime options as desired. # config.runtime.driver_memory_gb = 20 # config.runtime.use_local_storage = True mgr = ClusterManager(config) with mgr.managed_cluster() as spark: df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a", "b"]) df.show()