Run Python jobs on a Spark Cluster with Spark ConnectΒΆ
In this tutorial you will learn how to start a Spark cluster on HPC compute nodes with the sparkctl
command-line interface and then run Spark jobs in Python through pyspark-client
with the Spark
Connect Server.
Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory for the Spark master process and user application + Spark driver and 2 complete nodes for Spark workers.
$ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
Activate the Python environment that contains sparkctl.
$ module load python $ source ~/python-envs/sparkctl
Configure the Spark cluster. The sparkctl code will detect the compute nodes based on Slurm environment variables.
$ sparkctl configure
Optional, inpect the Spark configuration in
./conf
.Start the cluster.
$ sparkctl start
Set the environment variables
SPARK_CONF_DIR
andJAVA_HOME
. This will ensure that your application uses the Spark settings created in step 2. Instructions will be printed to the console. For example:$ export SPARK_CONF_DIR=$(pwd)/conf $ export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7
Connect to the Spark Connect Server and run a job.
$ python
>>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() >>> df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a","b"]) >>> df.show()