Run jobs interactively with sparkctl¶

In this tutorial you will learn how to start a Spark cluster on HPC compute nodes and then run Spark jobs interactively through pyspark-client with the Spark Connect Server.

Unlike the tutorials that call for starting Spark on the command line, sparkctl sets environment variables for you.

Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory for the Spark master process and user application + Spark driver and 2 complete nodes for Spark workers.
```
$ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
```

Activate the Python environment that contains sparkctl.

$ module load python
$ source ~/python-envs/sparkctl

Configure and start the Spark cluster.

Note

This workflow requires that you enable the Spark Connect Server.

from sparkctl import ClusterManager, make_default_spark_config

# This loads your global sparkctl configuration file (~/.sparkctl.toml).
config = make_default_spark_config()
config.runtime.start_connect_server = True
# Set other options as desired.
mgr = ClusterManager(config)
mgr.configure()

2025-07-12 13:00:24.327 | INFO     | sparkctl.cluster_manager:_add_spark_settings_to_defaults_file:281 - Set driver memory to 10 GB
2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:_config_executors:352 - Configured Spark to start 2 executors
2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:_config_executors:353 - Set spark.sql.shuffle.partitions=10 and spark.executor.memory=2g
2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:configure:100 - Configured Spark workers to use /scratch/dthom/sparkctl/spark_scratch for shuffle data.
2025-07-12 13:00:24.329 | INFO     | sparkctl.cluster_manager:_write_workers:456 - Wrote worker 1 to /scratch/dthom/sparkctl/conf/workers
2025-07-12 13:00:24.329 | INFO     | sparkctl.cluster_manager:configure:108 - Wrote sparkctl configuration to /scratch/dthom/repos/sparkctl/config.json

mgr.start()

starting org.apache.spark.deploy.master.Master, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.master.Master-1-dthom-39537s.out
2025-07-12 13:00:32.052 | INFO     | sparkctl.cluster_manager:_start:176 - Started Spark master processes on dthom-39537s
starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /scratch/dthom/repos/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.sql.connect.service.SparkConnectServer-1-dthom-39537s.out
2025-07-12 13:00:34.764 | INFO     | sparkctl.cluster_manager:_start:181 - Started Spark connect server
starting org.apache.spark.deploy.worker.Worker, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.worker.Worker-1-dthom-39537s.out
2025-07-12 13:00:37.648 | INFO     | sparkctl.cluster_manager:_start:200 - Spark worker memory = 4 GB

Run a Spark job.

spark = mgr.get_spark_session()
df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a","b"])
df.show(n=5)

+---+---+
|  a|  b|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
|  3|  4|
|  4|  5|
+---+---+
only showing top 5 rows 

Shut down the cluster.
```
mgr.stop()
```