Run jobs interactively with sparkctlΒΆ

In this tutorial you will learn how to start a Spark cluster on HPC compute nodes and then run Spark jobs interactively through pyspark-client with the Spark Connect Server.

Unlike the tutorials that call for starting Spark on the command line, sparkctl sets environment variables for you.

  1. Allocate compute nodes, such as with Slurm. This example acquires 4 CPUs and 30 GB of memory for the Spark master process and user application + Spark driver and 2 complete nodes for Spark workers.

    $ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N2 --account=<your-account> --mem=240G
    
  2. Activate the Python environment that contains sparkctl.

    $ module load python
    $ source ~/python-envs/sparkctl
    
  3. Configure and start the Spark cluster.

    Note

    This workflow requires that you enable the Spark Connect Server.

    from sparkctl import ClusterManager, make_default_spark_config
    
    # This loads your global sparkctl configuration file (~/.sparkctl.toml).
    config = make_default_spark_config()
    config.runtime.start_connect_server = True
    # Set other options as desired.
    mgr = ClusterManager(config)
    mgr.configure()
    
    2025-07-12 13:00:24.327 | INFO     | sparkctl.cluster_manager:_add_spark_settings_to_defaults_file:281 - Set driver memory to 10 GB
    2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:_config_executors:352 - Configured Spark to start 2 executors
    2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:_config_executors:353 - Set spark.sql.shuffle.partitions=10 and spark.executor.memory=2g
    2025-07-12 13:00:24.328 | INFO     | sparkctl.cluster_manager:configure:100 - Configured Spark workers to use /scratch/dthom/sparkctl/spark_scratch for shuffle data.
    2025-07-12 13:00:24.329 | INFO     | sparkctl.cluster_manager:_write_workers:456 - Wrote worker 1 to /scratch/dthom/sparkctl/conf/workers
    2025-07-12 13:00:24.329 | INFO     | sparkctl.cluster_manager:configure:108 - Wrote sparkctl configuration to /scratch/dthom/repos/sparkctl/config.json
    
    mgr.start()
    
    starting org.apache.spark.deploy.master.Master, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.master.Master-1-dthom-39537s.out
    2025-07-12 13:00:32.052 | INFO     | sparkctl.cluster_manager:_start:176 - Started Spark master processes on dthom-39537s
    starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /scratch/dthom/repos/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.sql.connect.service.SparkConnectServer-1-dthom-39537s.out
    2025-07-12 13:00:34.764 | INFO     | sparkctl.cluster_manager:_start:181 - Started Spark connect server
    starting org.apache.spark.deploy.worker.Worker, logging to /scratch/dthom/sparkctl/spark_scratch/logs/spark-dthom-org.apache.spark.deploy.worker.Worker-1-dthom-39537s.out
    2025-07-12 13:00:37.648 | INFO     | sparkctl.cluster_manager:_start:200 - Spark worker memory = 4 GB
    
  4. Run a Spark job.

    spark = mgr.get_spark_session()
    df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a","b"])
    df.show(n=5)
    
    +---+---+
    |  a|  b|
    +---+---+
    |  0|  1|
    |  1|  2|
    |  2|  3|
    |  3|  4|
    |  4|  5|
    +---+---+
    only showing top 5 rows 
    
  5. Shut down the cluster.

    mgr.stop()