Deploy sparkctl in an HPC environment¶

This is a one-time operation to be performed by an administrator or a user with write access to a common location on the shared filesystem.

If you are a sparkctl user, you should not need to perform this step unless you want to use a custom version of Spark.

Sparkctl requires that Apache Spark and all dependent software are installed on the shared filesystem, accessible by all compute nodes.

At minimum, this includes Spark and Java. If your users want to use a PostgreSQL-based Hive metastore, you must also download Apache Hive, Hadoop, and an integration jar file for postgres.

Here is an example filesystem layout:

/datasets/images/apache_spark
├── apache-hive-4.0.1-bin.tar.gz
├── hadoop-3.4.1/
├── jdk-21.0.7/
├── postgresql-42.7.4.jar
├── spark-4.0.0-bin-hadoop3/

URLs¶

Download locations will vary over time. Here is a set of URLs with software tested with Apache Spark v4.0.0:

https://dlcdn.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
https://download.oracle.com/java/21/latest/jdk-21_linux-x64_bin.tar.gz
https://archive.apache.org/dist/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
https://downloads.apache.org/hive/hive-4.0.1/apache-hive-4.0.1-bin.tar.gz
https://jdbc.postgresql.org/download/postgresql-42.7.4.jar

sparkctl configuration file¶

This command will create a default sparkctl configuration file given this filesystem layout:

$ sparkctl default-config \
    /datasets/images/apache_spark/spark-4.0.0-bin-hadoop3 \
    /datasets/images/apache_spark/jdk-21.0.7 \
    --hadoop-path /datasets/images/apache_spark/hadoop-3.4.1 \
    --hive-tarball /datasets/images/apache_spark/apache-hive-4.0.1-bin.tar.gz \
    --postgresql-jar-file /datasets/images/apache_spark/postgresql-42.7.4.jar \
    --compute-environment slurm