CLI Reference¶
sparkctl¶
sparkctl comands
sparkctl [OPTIONS] COMMAND [ARGS]...
Options
- -c, --console-level <console_level>¶
Console log level
- Default:
'INFO'
- -f, --file-level <file_level>¶
Console log level
- Default:
'DEBUG'
- -r, --reraise-exceptions¶
Reraise unhandled sparkctl exceptions.
- Default:
False
configure¶
Create a Spark cluster configuration.
sparkctl configure [OPTIONS]
Options
- -d, --directory <directory>¶
Base directory for the cluster configuration
- Default:
PosixPath('.')
- -s, --spark-scratch <spark_scratch>¶
Directory to use for shuffle data.
- Default:
PosixPath('spark_scratch')
- -e, --executor-cores <executor_cores>¶
Number of cores per executor
- Default:
5
- -E, --executor-memory-gb <executor_memory_gb>¶
Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.
- -M, --driver-memory-gb <driver_memory_gb>¶
Driver memory in GB. This is the maximum amount of data that can be pulled into the application.
- Default:
10
- -o, --node-memory-overhead-gb <node_memory_overhead_gb>¶
Memory to reserve for system processes.
- Default:
10
- --dynamic-allocation, --no-dynamic-allocation¶
Enable Spark dynamic resource allocation.
- Default:
False
- -m, --shuffle-partition-multiplier <shuffle_partition_multiplier>¶
Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)
- Default:
1
- -t, --spark-defaults-template-file <spark_defaults_template_file>¶
Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.
- --local-storage, --no-local-storage¶
Use compute node local storage for shuffle data.
- Default:
False
- --connect-server, --no-connect-server¶
Enable the Spark connect server.
- Default:
False
- --history-server, --no-history-server¶
Enable the Spark history server.
- Default:
False
- --thrift-server, --no-thrift-server¶
Enable the Thrift server to connect a SQL client.
- Default:
False
- -l, --spark-log-level <spark_log_level>¶
Set the root log level for all Spark processes. Defaults to Spark’s defaults.
- Options:
debug | info | warn | error
- --hive-metastore, --no-hive-metastore¶
Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.
- Default:
False
- --postgres-hive-metastore, --no-postgres-hive-metastore¶
Create a metastore with PostgreSQL. Supports multiple Spark sessions.
- Default:
False
- -w, --metastore-dir <metastore_dir>¶
Set a custom directory for the metastore and warehouse.
- Default:
PosixPath('.')
- -P, --python-path <python_path>¶
Python path to set for Spark workers. Use the Python inside the Spark distribution by default.
- --resource-monitor, --no-resource-monitor¶
Enable resource monitoring.
- Default:
False
- --start, --no-start¶
Start the cluster after configuration.
- Default:
False
- --use-current-python, --no-use-current-python¶
Use the Python executable in the current environment for Spark workers. –python-path takes precedence.
- Default:
True
Examples:
$ sparkctl configure –start
$ sparkctl configure –shuffle-partition-multiplier 4 –local-storage
$ sparkctl configure –local-storage –thrift-server
default-config¶
Create a sparkctl config file that defines paths to Spark binaries. This is a one-time requirement when installing sparkctl in a new environment.
sparkctl default-config [OPTIONS] SPARK_PATH JAVA_PATH
Options
- -d, --directory <directory>¶
Directory in which to create the sparkctl config file.
- Default:
PosixPath('/home/runner')
- -e, --compute-environment <compute_environment>¶
Compute environment
- Options:
native | slurm
- -H, --hadoop-path <hadoop_path>¶
Directory containing Hadoop binaries.
- -h, --hive-tarball <hive_tarball>¶
File containing Hive binaries.
- -p, --postgresql-jar-file <postgresql_jar_file>¶
Path to PostgreSQL jar file.
Arguments
- SPARK_PATH¶
Required argument
- JAVA_PATH¶
Required argument
$ sparkctl default-config
/datasets/images/apache-spark/spark-4.0.0-bin-hadoop3
/datasets/images/apache-spark/jdk-21.0.7
-e slurm
$ sparkctl default-config ~/apache-spark/spark-4.0.0-bin-hadoop3 ~/jdk-21.0.8 -e native
start¶
Start a Spark cluster with an existing configuration.
sparkctl start [OPTIONS]
Options
- --wait, --no-wait¶
If True, wait until the user presses Ctrl-C or timeout is reached and then stop the cluster. If False, start the cluster and exit.
- Default:
False
- -d, --directory <directory>¶
Base directory for the cluster configuration
- Default:
PosixPath('.')
- -t, --timeout <timeout>¶
If –wait is set, timeout in minutes. Defaults to no timeout.
Examples:
$ sparkctl start
$ sparkctl start –directory ./my-spark-config
$ sparkctl start –wait
stop¶
Stop a Spark cluster.
sparkctl stop [OPTIONS]
Options
- -d, --directory <directory>¶
Base directory for the cluster configuration
- Default:
PosixPath('.')
Examples:
$ sparkctl stop
$ sparkctl stop –directory ./my-spark-config