# Run Python jobs on a Spark Cluster with Spark Connect In this tutorial you will learn how to start a Spark cluster on HPC compute nodes with the sparkctl command-line interface and then run Spark jobs in Python through `pyspark-client` with the Spark Connect Server. 1. Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory for the Spark master process and user application + Spark driver and 2 complete nodes for Spark workers. ```console $ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account= --mem=240G ``` 2. Activate the Python environment that contains sparkctl. ```console $ module load python $ source ~/python-envs/sparkctl ``` 3. Configure the Spark cluster. The sparkctl code will detect the compute nodes based on Slurm environment variables. ```console $ sparkctl configure ``` 4. Optional, inpect the Spark configuration in `./conf`. 5. Start the cluster. ```console $ sparkctl start ``` 6. Set the environment variables `SPARK_CONF_DIR` and `JAVA_HOME`. This will ensure that your application uses the Spark settings created in step 2. Instructions will be printed to the console. For example: ```console $ export SPARK_CONF_DIR=$(pwd)/conf $ export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7 ``` 7. Connect to the Spark Connect Server and run a job. ```console $ python ``` ```python >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate() >>> df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a","b"]) >>> df.show() ```