CLI Reference

sparkctl

sparkctl comands

sparkctl [OPTIONS] COMMAND [ARGS]...

Options

-c, --console-level <console_level>

Console log level

Default:

'INFO'

-f, --file-level <file_level>

Console log level

Default:

'DEBUG'

-r, --reraise-exceptions

Reraise unhandled sparkctl exceptions.

Default:

False

configure

Create a Spark cluster configuration.

sparkctl configure [OPTIONS]

Options

-d, --directory <directory>

Base directory for the cluster configuration

Default:

PosixPath('.')

-s, --spark-scratch <spark_scratch>

Directory to use for shuffle data.

Default:

PosixPath('spark_scratch')

-e, --executor-cores <executor_cores>

Number of cores per executor

Default:

5

-E, --executor-memory-gb <executor_memory_gb>

Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.

-M, --driver-memory-gb <driver_memory_gb>

Driver memory in GB. This is the maximum amount of data that can be pulled into the application.

Default:

10

-o, --node-memory-overhead-gb <node_memory_overhead_gb>

Memory to reserve for system processes.

Default:

10

--dynamic-allocation, --no-dynamic-allocation

Enable Spark dynamic resource allocation.

Default:

False

-m, --shuffle-partition-multiplier <shuffle_partition_multiplier>

Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)

Default:

1

-t, --spark-defaults-template-file <spark_defaults_template_file>

Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.

--local-storage, --no-local-storage

Use compute node local storage for shuffle data.

Default:

False

--connect-server, --no-connect-server

Enable the Spark connect server.

Default:

False

--history-server, --no-history-server

Enable the Spark history server.

Default:

False

--thrift-server, --no-thrift-server

Enable the Thrift server to connect a SQL client.

Default:

False

-l, --spark-log-level <spark_log_level>

Set the root log level for all Spark processes. Defaults to Spark’s defaults.

Options:

debug | info | warn | error

--hive-metastore, --no-hive-metastore

Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.

Default:

False

--postgres-hive-metastore, --no-postgres-hive-metastore

Create a metastore with PostgreSQL. Supports multiple Spark sessions.

Default:

False

-w, --metastore-dir <metastore_dir>

Set a custom directory for the metastore and warehouse.

Default:

PosixPath('.')

-P, --python-path <python_path>

Python path to set for Spark workers. Use the Python inside the Spark distribution by default.

--resource-monitor, --no-resource-monitor

Enable resource monitoring.

Default:

False

--start, --no-start

Start the cluster after configuration.

Default:

False

--use-current-python, --no-use-current-python

Use the Python executable in the current environment for Spark workers. –python-path takes precedence.

Default:

True

Examples:

$ sparkctl configure –start

$ sparkctl configure –shuffle-partition-multiplier 4 –local-storage

$ sparkctl configure –local-storage –thrift-server

default-config

Create a sparkctl config file that defines paths to Spark binaries. This is a one-time requirement when installing sparkctl in a new environment.

sparkctl default-config [OPTIONS] SPARK_PATH JAVA_PATH

Options

-d, --directory <directory>

Directory in which to create the sparkctl config file.

Default:

PosixPath('/home/runner')

-e, --compute-environment <compute_environment>

Compute environment

Options:

native | slurm

-H, --hadoop-path <hadoop_path>

Directory containing Hadoop binaries.

-h, --hive-tarball <hive_tarball>

File containing Hive binaries.

-p, --postgresql-jar-file <postgresql_jar_file>

Path to PostgreSQL jar file.

Arguments

SPARK_PATH

Required argument

JAVA_PATH

Required argument

Examples:

$ sparkctl default-config

/datasets/images/apache-spark/spark-4.0.0-bin-hadoop3

/datasets/images/apache-spark/jdk-21.0.7

-e slurm

$ sparkctl default-config ~/apache-spark/spark-4.0.0-bin-hadoop3 ~/jdk-21.0.8 -e native

start

Start a Spark cluster with an existing configuration.

sparkctl start [OPTIONS]

Options

--wait, --no-wait

If True, wait until the user presses Ctrl-C or timeout is reached and then stop the cluster. If False, start the cluster and exit.

Default:

False

-d, --directory <directory>

Base directory for the cluster configuration

Default:

PosixPath('.')

-t, --timeout <timeout>

If –wait is set, timeout in minutes. Defaults to no timeout.

Examples:

$ sparkctl start

$ sparkctl start –directory ./my-spark-config

$ sparkctl start –wait

stop

Stop a Spark cluster.

sparkctl stop [OPTIONS]

Options

-d, --directory <directory>

Base directory for the cluster configuration

Default:

PosixPath('.')

Examples:

$ sparkctl stop

$ sparkctl stop –directory ./my-spark-config