How to run Spark jobs in Python¶
These instructions assume that have allocated compute nodes and started a Spark cluster.
For all steps below, set the environment variable SPARK_CONF_DIR
, pointed at your configuration directory.
$ export SPARK_CONF_DIR=$(pwd)/conf
Interactive session with pyspark¶
Start a Spark process.
If using pyspark-client (via Spark Connect Server):
$ python >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
If using pyspark:
$ pyspark --master spark://$(hostname):7077
Optional: check your environment to ensure that all configuration settings are correct. Most importantly, ensure that you connected to the Spark cluster master and are not in local mode. pyspark prints the connection information during startup. For example:
Spark context available as 'sc' (master = spark://r2i7n35:7077, app id = app-20221202224041-0000).
You can dump all configuration settings with this command:
>>> spark.sparkContext.getConf().getAll()
Create or load dataframes with the global
SparkSession
variable namedspark
:>>> df = spark.read.parquet("my_data.parquet") >>> df.show()
Jupyter notebook¶
The command below will allow you to connect to the Spark cluster in a Jupyter notebook.
Start the notebook server
$ jupyter notebook --no-browser --port=8889 --ip=0.0.0.0
The Jupyter process will print a URL to the terminal. You can access it from your laptop after you forward the ports through an ssh tunnel.
$ ssh -L 8889:<your-compute-node-name>:8889 $USER@kestrel.hpc.nrel.gov
Script execution with spark-submit¶
Warning
spark-submit is available if you install pyspark but not pyspark-client.
Set the environment variable
SPARK_CONF_DIR
, pointed at your configuration directory.$ export SPARK_CONF_DIR=$(pwd)/conf
Start a Spark process.
$ spark-submit --master spark://$(hostname):7077 my_job.py
where
my_job.py
contains this code to connect to Spark:from pyspark.sql import SparkSession spark = SparkSession.builder.appName("my_app").getOrCreate()