InstallationΒΆ

  1. Create a virtual environment with Python 3.11 or later. This example uses the venv module in the standard library to create a virtual environment in your home directory.

    You may prefer conda or mamba.

    If you are running on an HPC, you may need to module load python first.

    $ python -m venv ~/python-envs/sparkctl
    
  2. Activate the virtual environment.

    $ source ~/python-envs/sparkctl/bin/activate
    

    Whenever you are done using sparkctl, you can deactivate the environment by running deactivate.

  3. Install the Python package sparkctl.

    If you will be using Spark Connect to run Spark jobs, the base installation is sufficient.

    Note

    This does not include spark-submit or pyspark.

    $ pip install sparkctl
    

    If you will be running Spark jobs with spark-submit or pyspark, you will need to install the full pyspark package. This command will do that:

    $ pip install sparkctl[pyspark]
    
  4. Optional, install from the main branch (or substitute another branch or tag).

    $ pip install git+https://github.com/NREL/sparkctl.git@main
    
  5. Create a one-time sparkctl default configuration file. The parameters will vary based on your environment. If no one has deployed the required dependencies in your environment, please refer to Deploy sparkctl in an HPC environment.

    $ sparkctl default-config \
        /datasets/images/apache_spark/spark-4.0.0-bin-hadoop3 \
        /datasets/images/apache_spark/jdk-21.0.7 \
        --compute-environment slurm
    
    Wrote sparkctl settings to /Users/dthom/.sparkctl.toml
    

    Refer to sparkctl default-config --help for additional options.

    The paths to the Spark binaries will likely not change often. This file will also seed the default values for your sparkctl configure commands, and so you may want to manually edit those settings.