# Run Python jobs interactively on a Spark Cluster

In this tutorial you will learn how to start a Spark cluster on HPC compute nodes and then run
Spark jobs in Python through `pyspark-client` with the Spark Connect Server.

1. Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory
   for the Spark master process and user application + Spark driver and 2 complete nodes for Spark
   workers.

   ```console
   $ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
   ```

2. Activate the Python environment that contains sparkctl.

   ```console
   $ module load python
   $ source ~/python-envs/sparkctl
   ```

3. Configure and start the Spark cluster.

   ```{eval-rst}
   .. note:: This workflow requires that you enable the Spark Connect Server.
   ```
   
   ```python
   from sparkctl import ClusterManager, make_default_spark_config
   
   # This loads your global sparkctl configuration file (~/.sparkctl.toml).
   config = make_default_spark_config()
   config.runtime.start_connect_server = True
   # Set other options as desired.
   mgr = ClusterManager(config)
   mgr.configure()
   ```
   ```console
   2025-07-12 13:00:24.327 | INFO     | sparkctl.cluster_manager:_add_spark_settings_to_defaults_file:281 - Set driver memory to 10 GB
   2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:_config_executors:352 - Configured Spark to start 2 executors
   2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:_config_executors:353 - Set spark.sql.shuffle.partitions=10 and spark.executor.memory=2g
   2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:configure:100 - Configured Spark workers to use /scratch/dthom/sparkctl/spark_scratch for shuffle data.
   2025-07-12 13:00:24.329 | INFO     | sparkctl.cluster_manager:_write_workers:456 - Wrote worker 1 to /scratch/dthom/sparkctl/conf/workers
   2025-07-12 13:00:24.329 | INFO     | sparkctl.cluster_manager:configure:108 - Wrote sparkctl configuration to /scratch/dthom/repos/sparkctl/config.json
   ```
   ```python
   mgr.start()
   ```
   ```console
   starting org.apache.spark.deploy.master.Master, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.master.Master-1-dthom-39537s.out
   2025-07-12 13:00:32.052 | INFO     | sparkctl.cluster_manager:_start:176 - Started Spark master processes on dthom-39537s
   starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /scratch/dthom/repos/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.sql.connect.service.SparkConnectServer-1-dthom-39537s.out
   2025-07-12 13:00:34.764 | INFO     | sparkctl.cluster_manager:_start:181 - Started Spark connect server
   starting org.apache.spark.deploy.worker.Worker, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.worker.Worker-1-dthom-39537s.out
   2025-07-12 13:00:37.648 | INFO     | sparkctl.cluster_manager:_start:200 - Spark worker memory = 4 GB
   ```
   
4. Run a Spark job.

   ```python
   spark = mgr.get_spark_session()
   df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a","b"])
   df.show(n=5)
   ```
   ```console
   +---+---+
   |  a|  b|
   +---+---+
   |  0|  1|
   |  1|  2|
   |  2|  3|
   |  3|  4|
   |  4|  5|
   +---+---+
   only showing top 5 rows 
   ```

5. Shut down the cluster.

   ```python
   mgr.stop()
   ```