# Installation

1. Create a virtual environment with Python 3.11 or later. This example uses the `venv` module in
   the standard library to create a virtual environment in your home directory.

   You may prefer `conda` or `mamba`.

   If you are running on an HPC, you may need to `module load python` first.

   ```console
   $ python -m venv ~/python-envs/sparkctl
   ```

2. Activate the virtual environment.

   ```console
   $ source ~/python-envs/sparkctl/bin/activate
   ```

   Whenever you are done using sparkctl, you can deactivate the environment by running `deactivate`.

3. Install the Python package `sparkctl`.

   If you will be using Spark Connect to run Spark jobs, the base installation is sufficient.
   
   ```{eval-rst}
   .. note:: This does not include `spark-submit` or `pyspark`.
   ```

   ```console
   $ pip install sparkctl
   ```
   
   If you will be running Spark jobs with `spark-submit` or `pyspark`, you will need to install
   the full `pyspark` package. This command will do that:

   ```console
   $ pip install sparkctl[pyspark]
   ```
   
4. Optional, install from the main branch (or substitute another branch or tag).

   ```console
   $ pip install git+https://github.com/NREL/sparkctl.git@main
   ```

5. Create a one-time sparkctl default configuration file. The parameters will vary based on your
   environment. If no one has deployed the required dependencies in your environment, please refer
   to {ref}`deploy-sparkctl`.

   ```bash
   $ sparkctl default-config \
       /datasets/images/apache_spark/spark-4.0.0-bin-hadoop3 \
       /datasets/images/apache_spark/jdk-21.0.7 \
       --compute-environment slurm
   ```
   ```console
   Wrote sparkctl settings to /Users/dthom/.sparkctl.toml
   ```
   Refer to `sparkctl default-config --help` for additional options.
   
   The paths to the Spark binaries will likely not change often. This file will also seed the default
   values for your `sparkctl configure` commands, and so you may want to manually edit those settings.