Explanation

Spark Cluster Overview

This Spark documentation page gives an overview of how Spark operates.

Cluster Mode

sparkctl always configures Spark clusters in standalone mode. Given that sparkctl expects clusters to be ephemeral, the greater sophistication of YARN and Kubernetes cluster managers is not required.

Submitting Applications

Please refer to this documentation page for Spark’s guidance on submitting applications.

To get all submission tools in a Python environment, install pyspark as follows:

$ pip install pyspark

Clients for other languages are available at the main Spark downloads page

Spark Connect

Spark Connect is a relatively new feature that simplifies client installation and configuration. Please refer to Spark’s documentation for details. If you want to configure and start a Spark cluster, and then connect to it, all within one Python session, this is the recommended workflow.

Note that there are some caveats listed here.

You enable the Spark connect server with this sparkctl command:

$ sparkctl configure --connect-server

To install the only the client for Python:

$ pip install pyspark-client