# Run Python jobs on a Spark Cluster in a script

In this tutorial you will learn how to start a Spark cluster on HPC compute nodes and then run
Spark jobs in Python through `pyspark-client` with the Spark Connect Server in a script.

The key difference between this and other tutorials is that this tutorial uses `sparkctl` as a
Python library to hide the details of starting the cluster and setting environment variables.

1. Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory
   for the Spark master process and user application + Spark driver and 2 complete nodes for Spark
   workers.

   ```console
   $ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
   ```

2. Activate the Python environment that contains sparkctl.

   ```console
   $ module load python
   $ source ~/python-envs/sparkctl
   ```

3. Add the code below to a Python script. This code block will configure and start the Spark
   cluster, run your Spark job, and then stop the cluster.

   ```python
   from sparkctl import ClusterManager, make_default_spark_config
   
   # This loads your global sparkctl configuration file (~/.sparkctl.toml).
   config = make_default_spark_config()
   # Set runtime options as desired.
   # config.runtime.driver_memory_gb = 20
   # config.runtime.use_local_storage = True
   mgr = ClusterManager(config)
   with mgr.managed_cluster() as spark:
       df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a", "b"])
       df.show()