Explanation¶
This section provides conceptual background on sparkctl’s design, architecture, and the reasoning behind key decisions.
Why sparkctl Exists¶
High-Performance Computing (HPC) environments present unique challenges for running Apache Spark:
No cloud infrastructure: HPCs lack the managed services (EMR, Dataproc, HDInsight) that handle Spark cluster lifecycle in cloud environments.
No administrative access: Users typically cannot install system-wide services or configure YARN/Kubernetes clusters.
Ephemeral allocations: Compute nodes are allocated temporarily through job schedulers like Slurm, not provisioned permanently.
sparkctl bridges this gap by automating Spark cluster configuration and lifecycle management within the constraints of HPC environments.
Architecture Overview¶
sparkctl manages a complete Spark standalone cluster within your Slurm allocation:
flowchart TB
subgraph allocation["Your Slurm Allocation"]
subgraph master["Slurm Head Node"]
spark_master["Spark Master"]
driver["Your Driver Application"]
end
subgraph workers["Worker Nodes"]
subgraph worker1["Worker Node 1"]
sw1["Spark Worker"]
exec1["Executors"]
end
subgraph worker2["Worker Node 2"]
sw2["Spark Worker"]
exec2["Executors"]
end
end
driver --> spark_master
spark_master --> sw1
spark_master --> sw2
sw1 --> exec1
sw2 --> exec2
end
Key components:
Spark Master: Coordinates the cluster, tracks available workers
Spark Workers: Run on each compute node, manage local resources
Executors: JVM processes that run your tasks, spawned by workers
Driver: Your application, which submits tasks to the cluster
Why Standalone Mode?¶
Spark supports three cluster managers: Standalone, YARN, and Kubernetes. sparkctl uses Standalone mode exclusively because:
Factor |
Standalone |
YARN/Kubernetes |
|---|---|---|
Setup complexity |
Minimal |
Requires cluster-wide services |
Admin access needed |
No |
Yes |
Ephemeral clusters |
Natural fit |
Designed for persistent clusters |
Resource isolation |
Per-allocation |
Cluster-wide scheduling |
For ephemeral, user-controlled clusters on HPCs, Standalone mode is the pragmatic choice. YARN and Kubernetes add complexity without benefit when your cluster exists only for the duration of a single Slurm job.
Resource Calculation¶
sparkctl automatically calculates optimal Spark settings based on your allocation:
Executors per node: Based on available CPU cores and memory, leaving headroom for the worker process
Memory per executor: Based on available memory
Shuffle partitions: Based on total executor cores across the cluster
These calculations follow Spark best practices for memory management and parallelism. You can
override any setting via spark-defaults.conf if needed.
The Cluster Lifecycle¶
sparkctl manages four phases:
Configure (
sparkctl configure): Analyzes your Slurm allocation, generates Spark configuration files in./conf/Start (
sparkctl start): Launches Spark Master on the head node, Workers on all nodesRun: Your applications connect to the cluster and execute
Stop (
sparkctl stop): Gracefully shuts down all Spark processes
This explicit lifecycle gives you full control and visibility, unlike managed services that hide these details.
Configuration Files¶
sparkctl uses two configuration mechanisms:
User Configuration (~/.sparkctl.toml)¶
This TOML file stores environment-specific settings that rarely change:
[binaries]
spark_home = "/path/to/spark"
java_home = "/path/to/java"
These settings tell sparkctl where to find Spark and Java. You can also set global settings that
apply every time you run sparkctl configure. For example, if you always want to use Spark Connect,
you can set spark_connect_server = true and avoid having to set it each time you configure.
Runtime Configuration (./conf/)¶
Each time you run sparkctl configure, it generates a ./conf/ directory containing:
spark-defaults.conf: Spark settings (executors, memory, shuffle partitions)spark-env.sh: Environment variables for Spark processesworkers: List of worker hostnameslog4j2.properties: Logging configuration
These files are specific to your current allocation and working directory.
Custom Spark Settings¶
To customize Spark settings beyond what sparkctl generates:
Create a template file (e.g.,
spark-defaults.conf.template)Add your custom settings
Run
sparkctl configure --spark-defaults-template-file spark-defaults.conf.template
sparkctl will include your settings and append its calculated values.
Spark Cluster Overview¶
For general Spark architecture concepts, refer to the official Spark Cluster Overview.
Submitting Applications¶
Please refer to Spark’s Submitting Applications
documentation for details on spark-submit options.
To get all submission tools in a Python environment:
$ pip install pyspark
Clients for other languages are available at the main Spark downloads page.
Spark Connect¶
Spark Connect is a relatively new feature that simplifies client installation and configuration. It provides a thin client that communicates with the cluster via gRPC, avoiding the need to install the full Spark distribution.
Benefits:
Smaller client installation (
pyspark-clientinstead of fullpyspark)Better isolation between client and cluster Python environments
Simpler remote connectivity
Enable the Spark Connect server:
$ sparkctl configure --connect-server
Install only the client:
$ pip install pyspark-client
Note the caveats around Spark Connect - some APIs behave differently.