Start a Spark ClusterΒΆ

This page assumes that you have allocated compute nodes via Slurm.

  1. Activate the Python environment that contains sparkctl.

    $ module load python
    $ source ~/python-envs/sparkctl
    
  2. Configure and start the Spark cluster. The sparkctl code will detect the compute nodes based on Slurm environment variables.

     $ sparkctl configure
    
  3. Optional, inpect the Spark configuration in ./conf.

  4. Start the cluster.

    $ sparkctl start
    
  5. Set the environment variable SPARK_CONF_DIR. This will ensure that your application uses the Spark settings created in step 2. Instructions will be printed to the console. By default, it will be

    $ export SPARK_CONF_DIR=$(pwd)/conf
    
  6. Set the JAVA_HOME environment variable to be the same as the java used by Spark. This should bin in your /.sparkctl.toml configuration file.

    $ export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7
    
  7. Run your application.

  8. Shut down the cluster.

    $ sparkctl stop