InstallationΒΆ
Create a virtual environment with Python 3.11 or later. This example uses the
venvmodule in the standard library to create a virtual environment in your home directory.You may prefer
condaormamba.If you are running on an HPC, you may need to
module load pythonfirst.$ python -m venv ~/python-envs/sparkctl
Activate the virtual environment.
$ source ~/python-envs/sparkctl/bin/activate
Whenever you are done using sparkctl, you can deactivate the environment by running
deactivate.Install the Python package
sparkctl.If you will be using Spark Connect to run Spark jobs, the base installation is sufficient.
Note
This does not include spark-submit or pyspark.
$ pip install sparkctl
If you will be running Spark jobs with
spark-submitorpyspark, you will need to install the fullpysparkpackage. This command will do that:$ pip install sparkctl[pyspark]
Optional, install from the main branch (or substitute another branch or tag).
$ pip install git+https://github.com/NREL/sparkctl.git@main
Create a one-time sparkctl default configuration file. The parameters will vary based on your environment. If no one has deployed the required dependencies in your environment, please refer to Deploy sparkctl in an HPC environment.
$ sparkctl default-config \ /datasets/images/apache_spark/spark-4.0.0-bin-hadoop3 \ /datasets/images/apache_spark/jdk-21.0.7 \ --compute-environment slurm
Wrote sparkctl settings to /Users/dthom/.sparkctl.tomlRefer to
sparkctl default-config --helpfor additional options.The paths to the Spark binaries will likely not change often. This file will also seed the default values for your
sparkctl configurecommands, and so you may want to manually edit those settings.