Debugging

This page describes how to debug certain Spark errors when using sparkctl.

Spark web UI

The web UI is a good first place to look for problems. Connect to ports 8080 and 4040 on the nodes running Spark. You may need to create an ssh tunnel. This is an example that creates a tunnel between a user’s Mac laptop and NREL’s Kestrel HPC (adjust as necessary for Windows):

$ export COMPUTE_NODE=<your-compute-node-name>
$ ssh -L 4040:$COMPUTE_NODE:4040 -L 8080:$COMPUTE_NODE:8080 $USER@kestrel.hpc.nrel.gov

Open your browser to http://localhost:4040 after configuring the tunnel to access the web UI.

Log files

sparkctl configures Spark to record log files in the base directory. Spark master, worker, connect server, etc, will be in ./spark_scratch/logs. Executor logs will be in ./spark_scratch/worker/app-*/*/stderr

For example,

$ tree spark_scratch
spark_scratch
├── local
├── logs
│   ├── spark-dthom-org.apache.spark.deploy.master.Master-1-dthom-39537s.out
│   ├── spark-dthom-org.apache.spark.deploy.worker.Worker-1-dthom-39537s.out
│   ├── spark-dthom-org.apache.spark.sql.connect.service.SparkConnectServer-1-dthom-39537s.out
└── workers
    └── app-20250712164208-0000
        ├── 0
        │   ├── stderr
        │   └── stdout
        └── 1
            ├── stderr
            └── stdout

Connection errors are often logged in the master/worker/connect server log files. Problems with queries are often visible in the executor stderr files. For example, if a job seems stalled, you can tail the stderr files to see what is happening:

$ tail -f spark_scratch/worker/*/*/stderr

If you have many executors, you may want to tail only the most recent ones. Identify them with

$ find spark_scratch -type f -name stderr -exec stat -f '%m %Sm %N' {} + 2>/dev/null | sort -n

Spark shuffle partitions

A common performance issue when running complex queries is due to a non-ideal setting for spark.sql.shuffle.partitions. The default Spark value is 200. Some online sources recommend setting it to 1-4x the total number of CPUs in your cluster.

sparkctl tries to help by automatically detecting the number of CPUs in your cluster and setting this value to that number. You can give sparkctl a multiplier to increase this. For example, if your cluster has 800 CPUs and you want 4x that, set this option when you configure the cluster:

$ sparkctl configure --shuffle-partition-multiplier 4 

Now you will get 3200 shuffle partitions. You can set a specific value by running sparkctl configure, then modify conf/spark-defaults.conf with your value, then running sparkctl start.

This video by a Spark developer offers a recommendation worked out very well for us.

Use this formula:

  num_partitions = max_shuffle_write_size / target_partition_size

You will have to run your job once to determine max_shuffle_write_size. You can find it on the Spark UI Stages tab in the Shuffle Write column. Your target_partition_size should be between 128 - 200 MB.

The minimum partitions value should be the total number of cores in the cluster unless you want to leave some cores available for other jobs that may be running simultaneously.

Slow local storage

Spark will write lots of temporary data to local storage during shuffle operations. If your joins cause lots of shuffling, it is very important that your local storage be fast. If you direct Spark to use the shared filesystem (e.g., Lustre) for local storage, your mileage will vary. If the system is idle, it might work. If it is saturated, your job may fail. A lot depends on the backend storage of your shared filesystem.

If you suspect that jobs are timing out because of slow shuffle write and reads, use local storage instead. You can do this by acquiring compute nodes with local storage and then set this option when you configure your Spark cluster:

$ sparkctl configure --local-storage

Executors are spilling to disk

If you observe that your Spark job is running very slowly, you may have a problem with executors spilling to disk. This occurs when the executor memory is too small to hold all of the data that it is processing. When this happens, Spark will spill data to disk, which is much slower than processing in memory.

One way to detect this condition is to look at the executor stderr log files. You will see this log message over and over:

25/07/15 14:24:06 INFO UnsafeExternalSorter: Thread 60 spilling sort data of 4.6 GiB to disk (201  times so far)

If this has occurred 201 times, then you might be better off canceling the job and re-running with a different configuration.

First, ensure that you have set spark.sql.shuffle.partitions to a reasonable value, as discussed above.

Second, your executor memory may be too small. For example, your compute node’s CPU-to-memory ratio may be too high. Suppose that in the example above where the executor had to spill 4.6 GiB of data to disk, the executor was configured with 7 GB of memory. You could reconfigure the cluster to double the executor memory. This can be done in the sparkctl configure command indirectly by doubling the executor cores (--executor-cores=10) or directly by setting --executor-memory-gb. You can also set the spark.executor.memory value in the conf/spark-defaults.conf file in between running sparkctl configure and sparkctl start.

Third, you may be experiencing data skew. This has happened frequently for us when we are perform disaggregation or duplication mappings that explode data sizes. Refer to the next section.

Data skew

This topic is addressed by many online sources. This is a technique that has worked for us.

If your query will produce high data skew, you can use a salting technique to balance the data. For example,

df.withColumn("salt_column", F.lit(F.rand() * (num_partitions - 1))) \
    .groupBy("county", "salt_column") \
    .agg(F.sum("county")) \
    .drop("salt_column")

Here is a resource from DataBricks with more information.

Performance monitoring

If your job is running slowly, check CPU utilization on your compute nodes.

Here is an example of how to run htop on multiple nodes simultaneously with tmux.

Download this script

Run it like this:

$ ./ssh-multi node1 node2 ... nodeN

It will start tmux with one pane for each node and synchonize mode enabled. Typing in one pane types everywhere.

$ htop

Note

The file ./conf/workers contains the node names in your cluster.

Automated resource monitoring

sparkctl integrates with rmon to collect resource utilization stats from all compute nodes in your cluster. You can enable this monitoring when you configure the cluster:

$ sparkctl configure --resource-monitor

The stats will be summarized and plotted in HTML files in ./stats-output when you stop the cluster.

Spark tuning resources

  • Spark tuning guide from Apache: https://spark.apache.org/docs/latest/tuning.html

  • Spark tuning guide from Amazon: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

  • Performance recommendations: https://www.youtube.com/watch?v=daXEp4HmS-E&t=4251s