Frequently Asked Questions

Installation and Setup

Why do I need both sparkctl and pyspark?

sparkctl manages the Spark cluster (starting/stopping processes, generating configuration). pyspark provides the Spark client libraries you use to submit jobs and interact with the cluster. They serve different purposes:

  • sparkctl: Cluster lifecycle management

  • pyspark: Spark application development and submission

You can install both together with pip install "sparkctl[pyspark]".

What’s the difference between ~/.sparkctl.toml and ./conf/?

  • ~/.sparkctl.toml: Your personal configuration that persists across sessions. Contains paths to Spark/Java and HPC-specific settings. Set this once per user.

  • ./conf/: Generated fresh each time you run sparkctl configure. Contains Spark configuration specific to your current HPC allocation and working directory.

See Configuration Files for details.

Running Jobs

My job is running in Spark local mode and did not attach to the cluster master.

Ensure SPARK_CONF_DIR is set correctly before running your application:

$ export SPARK_CONF_DIR=/path/to/spark/conf

My job fails with “Java not found” or “JAVA_HOME not set”

Ensure JAVA_HOME is set correctly before running your application:

$ export JAVA_HOME=/path/to/java

The Java path should match what’s in your ~/.sparkctl.toml. sparkctl uses this for Spark processes, but your application also needs it set.

I’m getting an error about mismatched Python versions.

The cluster may be running a different Python version than your application. Set this environment variable before running your application:

$ export PYSPARK_PYTHON=$(which python)

I’m getting an error about mismatched Spark versions.

If you are running pyspark/spark-submit after installing via pip install sparkctl[pyspark], your version of pyspark must match the cluster version exactly. Client version 4.0.1 is incompatible with cluster version 4.0.0.

Why can’t my workers connect to the master?

Common causes:

  1. High-bandwidth nodes: Some NREL Kestrel compute nodes have two network cards, which Spark cannot deal with. Set --constaint lbw when allocating nodes.

Check the Spark master logs in ./spark_scratch/logs/ for connection errors.

How do I connect to the Spark Web UI?

The Spark master runs a web UI on port 4040 (driver) or 8080 (master). Since HPC compute nodes aren’t directly accessible, use SSH tunneling:

$ ssh -L 8080:$(hostname):8080 user@hpc-login-node

Then open http://localhost:8080 or http://localhost:4040 in your browser.

See Spark web UI for more details.

Why is my job slow?

Common causes:

  1. Shuffle spilling: Executors are writing shuffle data to disk. Increase executor memory or use faster local storage.

  2. Data skew: Some partitions have much more data than others. Consider salting keys.

  3. Too few partitions: Increase spark.sql.shuffle.partitions.

  4. Too many partitions: Decrease partitions if you have many small tasks.

  5. Slow storage: Ensure shuffle storage uses fast local SSDs, not shared filesystem.

  6. Query too complex: If you are trying to run a very complex query where subtasks have different data sizes and partitioning needs, consider breaking the query into smaller parts with different settings. Persist intermediate results to the filesystem so that you can checkpoint and make incremental progress.

  7. Non-ideal partitioning: If you are trying to partition-by-column in the same query as your main work, especially where you significantly increased the shuffle partitions, persist your main work first. Then repartition in a second task.

See the Debugging for performance troubleshooting.

Configuration

How do I increase executor memory?

sparkctl calculates executor memory automatically based on available RAM. To override:

Edit the ./conf/spark-defaults.conf after running sparkctl configure but before starting the cluster.

Can I use YARN or Kubernetes instead of Standalone mode?

No. sparkctl is designed specifically for Standalone mode because:

  • It requires no cluster-wide services

  • It works without admin access

  • It’s designed for ephemeral, user-controlled clusters

If you have YARN or Kubernetes available, you likely don’t need sparkctl.

Slurm Integration

What Slurm allocation should I use?

A typical allocation for sparkctl:

$ salloc -t 01:00:00 -n4 --mem=30G : -N2 --mem=240G

This creates:

  • A “master” allocation (4 cores, 30GB) for the Spark master and your driver

  • Worker nodes (2 full nodes) for Spark executors

See Heterogeneous Slurm jobs for details.

Can I use sparkctl in a batch script (sbatch)?

Yes. Example sbatch script:

#!/bin/bash
#SBATCH --account=my_account
#SBATCH --nodes=3
#SBATCH --time=02:00:00
#SBATCH --mem=240G

source ~/python-envs/sparkctl/bin/activate
sparkctl configure
sparkctl start
export SPARK_CONF_DIR=$(pwd)/conf
export JAVA_HOME=/path/to/java
spark-submit --master spark://$(hostname):7077 my-job.py
sparkctl stop

What happens if a compute node fails?

By default, Slurm kills the entire job if any node fails. For long-running jobs, consider How to handle compute node failures.

Spark Connect

When should I use Spark Connect?

Use Spark Connect when:

  • You want a minimal client installation (pyspark-client is much smaller than pyspark)

  • You’re connecting from a remote machine

  • You want better isolation between client and cluster Python environments

What are the limitations of Spark Connect?

Some Spark APIs behave differently with Spark Connect. See the official documentation for details.

How do I enable Spark Connect?

$ sparkctl configure --connect-server
$ sparkctl start

Then connect with:

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://hostname:15002").getOrCreate()

Troubleshooting

Where are the log files?

  • Spark master/worker logs: ./spark_scratch/logs/

  • sparkctl log: ./sparkctl.log

  • Executor logs: Available via Spark Web UI or in worker directories

  $ find spark_scratch/workers/ -name stderr -exec ls -lh {} \;

How do I clean up after a failed job?

$ sparkctl stop  # Stop any running Spark processes
$ rm -rf ./conf ./spark_scratch/logs  # Remove generated files

sparkctl hangs during startup

Check if Spark processes from a previous job are still running:

$ ps aux | grep $USER
$ sparkctl stop

Also verify your Slurm allocation is still active:

$ squeue --me