InstallationΒΆ
Create a virtual environment with Python 3.11 or later. This example uses the
venv
module in the standard library to create a virtual environment in your home directory.You may prefer
conda
ormamba
.If you are running on an HPC, you may need to
module load python
first.$ python -m venv ~/python-envs/sparkctl
Activate the virtual environment.
$ source ~/python-envs/sparkctl/bin/activate
Whenever you are done using sparkctl, you can deactivate the environment by running
deactivate
.Install the Python package
sparkctl
.If you will be using Spark Connect to run Spark jobs, the base installation is sufficient.
Note
This does not include spark-submit or pyspark.
$ pip install sparkctl
If you will be running Spark jobs with
spark-submit
orpyspark
, you will need to install the fullpyspark
package. This command will do that:$ pip install sparkctl[pyspark]
Optional, install from the main branch (or substitute another branch or tag).
$ pip install git+https://github.com/NREL/sparkctl.git@main
Create a one-time sparkctl default configuration file. The parameters will vary based on your environment. If no one has deployed the required dependencies in your environment, please refer to Deploy sparkctl in an HPC environment.
$ sparkctl default-config \ /datasets/images/apache_spark/spark-4.0.0-bin-hadoop3 \ /datasets/images/apache_spark/jdk-21.0.7 \ --compute-environment slurm
Wrote sparkctl settings to /Users/dthom/.sparkctl.toml
Refer to
sparkctl default-config --help
for additional options.The paths to the Spark binaries will likely not change often. This file will also seed the default values for your
sparkctl configure
commands, and so you may want to manually edit those settings.