NREL Kestrel

This package was originally developed using NREL’s Kestrel HPC, which uses Slurm to manage resources.

This page contains information specific to Kestrel that users must consider.

Spark software

Spark is installed on Kestrel. Use these locations when your configure your environment.

/datasets/images/apache_spark
├── apache-hive-4.0.1-bin.tar.gz
├── hadoop-3.4.1/
├── jdk-21.0.7/
├── postgresql-42.7.4.jar
├── spark-4.0.0-bin-hadoop3/
$ sparkctl default-config \
    /datasets/images/apache_spark/spark-4.0.0-bin-hadoop3 \
    /datasets/images/apache_spark/jdk-21.0.7 \
    --hadoop-path /datasets/images/apache_spark/hadoop-3.4.1 \
    --hive-tarball /datasets/images/apache_spark/apache-hive-4.0.1-bin.tar.gz \
    --postgresql-jar-file /datasets/images/apache_spark/jdk-21.0.7 \
    --compute-environment slurm

Compute nodes

Low-bandwith compute nodes

Users must use only low-bandwidth compute nodes. The “high-bandwidth” nodes are very problematic for Spark. They contain two network interface cards (NICs) which have different IP addresses. An application addressing a hostname may get routed to either NIC. However, Spark will only be listening on one address. The result is that many intra-cluster network messages will be dropped, causing the cluster to fail.

Solution: Specify the constraint lbw (for low bandwidth) whenever you request nodes with salloc/ srun/sbatch.

Example:

$ salloc --account <my-account> -t 01:00:00 --constraint lbw

Heterogeneous Slurm jobs

Kestrel offers a shared partition where you can request a subset of a compute node’s CPUs and memory. Prefer this partition for your Spark master process and user application, as discussed at Heterogeneous Slurm jobs.

Local storage for Spark shuffle writes

A small portion of Kestrel compute nodes have local storage (~256). The drives are ~1.6 TB. Prefer these nodes for Spark workers if your jobs will shuffle lots of data. The local drives are much faster than Lustre and you’ll get better results.

$ salloc --account <my-account> -t 01:00:00 --partition nvme

Enable local storage for shuffle writes with

$ sparkctl configure --local-storage

Lustre will likely be sufficient for most jobs. If you will shuffle a large amount of data (TBs) and are unsure about how many nodes you’ll need to have enough room, start with Lustre (which will happen by default).

Lustre file striping

If you use Lustre for shuffle writes, consider enabling custom stripe settings to optimize performance. Run this command in your runtime directory before you write any data to it.

$ lfs setstripe -E 256K -L mdt -E 8M -c -1 -S 1M -E -1 -c -1 -p flash /scratch/$USER/work-dir