CLI Reference¶

sparkctl¶

sparkctl comands

sparkctl [OPTIONS] COMMAND [ARGS]...

Options

-c, --console-level <console_level>¶

Console log level

Default:: 'INFO'

-f, --file-level <file_level>¶

Console log level

Default:: 'DEBUG'

-r, --reraise-exceptions¶

Reraise unhandled sparkctl exceptions.

Default:: False

configure¶

Create a Spark cluster configuration.

sparkctl configure [OPTIONS]

Options

-d, --directory <directory>¶

Base directory for the cluster configuration

Default:: PosixPath('.')

-s, --spark-scratch <spark_scratch>¶

Directory to use for shuffle data.

Default:: PosixPath('spark_scratch')

-e, --executor-cores <executor_cores>¶

Number of cores per executor

Default:: 5

-E, --executor-memory-gb <executor_memory_gb>¶: Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.

-M, --driver-memory-gb <driver_memory_gb>¶

Driver memory in GB. This is the maximum amount of data that can be pulled into the application.

Default:: 10

-o, --node-memory-overhead-gb <node_memory_overhead_gb>¶

Memory to reserve for system processes.

Default:: 10

--dynamic-allocation, --no-dynamic-allocation¶

Enable Spark dynamic resource allocation.

Default:: False

-m, --shuffle-partition-multiplier <shuffle_partition_multiplier>¶

Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)

Default:: 1

-t, --spark-defaults-template-file <spark_defaults_template_file>¶: Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.

--local-storage, --no-local-storage¶

Use compute node local storage for shuffle data.

Default:: False

--connect-server, --no-connect-server¶

Enable the Spark connect server.

Default:: False

--history-server, --no-history-server¶

Enable the Spark history server.

Default:: False

--thrift-server, --no-thrift-server¶

Enable the Thrift server to connect a SQL client.

Default:: False

-l, --spark-log-level <spark_log_level>¶

Set the root log level for all Spark processes. Defaults to Spark’s defaults.

Options:: debug | info | warn | error

--hive-metastore, --no-hive-metastore¶

Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.

Default:: False

--postgres-hive-metastore, --no-postgres-hive-metastore¶

Create a metastore with PostgreSQL. Supports multiple Spark sessions.

Default:: False

-w, --metastore-dir <metastore_dir>¶

Set a custom directory for the metastore and warehouse.

Default:: PosixPath('.')

-P, --python-path <python_path>¶: Python path to set for Spark workers. Use the Python inside the Spark distribution by default.

--resource-monitor, --no-resource-monitor¶

Enable resource monitoring.

Default:: False

--start, --no-start¶

Start the cluster after configuration.

Default:: False

--use-current-python, --no-use-current-python¶

Use the Python executable in the current environment for Spark workers. –python-path takes precedence.

Default:: True

Examples:

$ sparkctl configure –start

$ sparkctl configure –shuffle-partition-multiplier 4 –local-storage

$ sparkctl configure –local-storage –thrift-server

default-config¶

Create a sparkctl config file that defines paths to Spark binaries. This is a one-time requirement when installing sparkctl in a new environment.

sparkctl default-config [OPTIONS] SPARK_PATH JAVA_PATH

Options

-d, --directory <directory>¶

Directory in which to create the sparkctl config file.

Default:: PosixPath('/home/runner')

-e, --compute-environment <compute_environment>¶

Compute environment

Options:: native | slurm

-H, --hadoop-path <hadoop_path>¶: Directory containing Hadoop binaries.

-h, --hive-tarball <hive_tarball>¶: File containing Hive binaries.

-p, --postgresql-jar-file <postgresql_jar_file>¶: Path to PostgreSQL jar file.

Arguments

SPARK_PATH¶: Required argument

JAVA_PATH¶: Required argument

Examples:

$ sparkctl default-config

/datasets/images/apache-spark/spark-4.0.0-bin-hadoop3

/datasets/images/apache-spark/jdk-21.0.7

-e slurm

$ sparkctl default-config ~/apache-spark/spark-4.0.0-bin-hadoop3 ~/jdk-21.0.8 -e native

start¶

Start a Spark cluster with an existing configuration.

sparkctl start [OPTIONS]

Options

--wait, --no-wait¶

If True, wait until the user presses Ctrl-C or timeout is reached and then stop the cluster. If False, start the cluster and exit.

Default:: False

-d, --directory <directory>¶

Base directory for the cluster configuration

Default:: PosixPath('.')

-t, --timeout <timeout>¶: If –wait is set, timeout in minutes. Defaults to no timeout.

Examples:

$ sparkctl start

$ sparkctl start –directory ./my-spark-config

$ sparkctl start –wait

stop¶

Stop a Spark cluster.

sparkctl stop [OPTIONS]

Options

-d, --directory <directory>¶

Base directory for the cluster configuration

Default:: PosixPath('.')

Examples:

$ sparkctl stop

$ sparkctl stop –directory ./my-spark-config