Run Python jobs interactively on a Spark ClusterΒΆ
In this tutorial you will learn how to start a Spark cluster on HPC compute nodes and then run
Spark jobs in Python through pyspark-client
with the Spark Connect Server.
Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory for the Spark master process and user application + Spark driver and 2 complete nodes for Spark workers.
$ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
Activate the Python environment that contains sparkctl.
$ module load python $ source ~/python-envs/sparkctl
Configure and start the Spark cluster.
Note
This workflow requires that you enable the Spark Connect Server.
from sparkctl import ClusterManager, make_default_spark_config # This loads your global sparkctl configuration file (~/.sparkctl.toml). config = make_default_spark_config() config.runtime.start_connect_server = True # Set other options as desired. mgr = ClusterManager(config) mgr.configure()
2025-07-12 13:00:24.327 | INFO | sparkctl.cluster_manager:_add_spark_settings_to_defaults_file:281 - Set driver memory to 10 GB 2025-07-12 13:00:24.328 | INFO | sparkctl.cluster_manager:_config_executors:352 - Configured Spark to start 2 executors 2025-07-12 13:00:24.328 | INFO | sparkctl.cluster_manager:_config_executors:353 - Set spark.sql.shuffle.partitions=10 and spark.executor.memory=2g 2025-07-12 13:00:24.328 | INFO | sparkctl.cluster_manager:configure:100 - Configured Spark workers to use /scratch/dthom/sparkctl/spark_scratch for shuffle data. 2025-07-12 13:00:24.329 | INFO | sparkctl.cluster_manager:_write_workers:456 - Wrote worker 1 to /scratch/dthom/sparkctl/conf/workers 2025-07-12 13:00:24.329 | INFO | sparkctl.cluster_manager:configure:108 - Wrote sparkctl configuration to /scratch/dthom/repos/sparkctl/config.json
mgr.start()
starting org.apache.spark.deploy.master.Master, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.master.Master-1-dthom-39537s.out 2025-07-12 13:00:32.052 | INFO | sparkctl.cluster_manager:_start:176 - Started Spark master processes on dthom-39537s starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /scratch/dthom/repos/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.sql.connect.service.SparkConnectServer-1-dthom-39537s.out 2025-07-12 13:00:34.764 | INFO | sparkctl.cluster_manager:_start:181 - Started Spark connect server starting org.apache.spark.deploy.worker.Worker, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.worker.Worker-1-dthom-39537s.out 2025-07-12 13:00:37.648 | INFO | sparkctl.cluster_manager:_start:200 - Spark worker memory = 4 GB
Run a Spark job.
spark = mgr.get_spark_session() df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a","b"]) df.show(n=5)
+---+---+ | a| b| +---+---+ | 0| 1| | 1| 2| | 2| 3| | 3| 4| | 4| 5| +---+---+ only showing top 5 rows
Shut down the cluster.
mgr.stop()