sparkctl API

sparkctl.config.make_default_spark_config() SparkConfig

Return a SparkConfig created from the user’s config file.

class sparkctl.cluster_manager.ClusterManager(config: SparkConfig, status: StatusTracker | None = None)

Manages operation of the Spark cluster.

classmethod from_config(config: SparkConfig) Self

Create a ClusterManager from a config instance.

Examples

>>> from sparkctl import ClusterManager, make_default_spark_config
>>> config = make_default_spark_config()
>>> config.runtime.start_connect_server = True
>>> mgr = ClusterManager.from_config(config)

See also

from_config_file

classmethod from_config_file(config_file: Path | str | None = None) Self

Create a ClusterManager from a config file. If filename is None, use the default config file (e.g., ~/.sparkctl.toml).

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file(config_file="config.json")

See also

from_config

classmethod load(directory: Path | str | None = None) Self

Load an active cluster manager from a directory containg a previously-created sparkctl config.

Parameters:

directory – Directory containing the sparkctl configuration files. Defaults to the current directory.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.load()
>>> mgr = ClusterManager.load(directory="path/to/sparkctl/config")

See also

from_config

clean() None

Delete all Spark runtime files in the directory.

configure() None

Configure a Spark cluster based on the input parameters.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
get_spark_session() SparkSession

Return a SparkSession for the current cluster.

Examples

>>> spark = mgr.get_spark_session()
>>> spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]).show()
set_workers(workers: list[str]) None

Set the workers for the cluster. Must be called after configure() and before start().

Parameters:

workers – Worker node names or IP addresses, will be used as ssh targets.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config(make_default_spark_config())
>>> mgr.configure()
>>> mgr.set_workers(["worker1", "worker2"])
>>> mgr.start()
get_workers() list[str]

Return the current worker node names.

start(print_env_paths: bool = True) None

Start the Spark cluster. The caller must have called configure() beforehand.

The environment variables SPARK_CONF_DIR and JAVA_HOME are set to correct values for the current process.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()
managed_cluster() Generator[SparkSession, None, None]

Configure and start the Spark cluster, yield a SparkSession in a context manager, stop the cluster after exit.

The environment variables SPARK_CONF_DIR and JAVA_HOME are set to correct values for the current process while the context is active and cleared when complete.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> with mgr.managed_start() as spark:
    df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
    df.show()
stop() None

Stop all Spark processes.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()
>>> mgr.stop()
pydantic model sparkctl.models.SparkConfig

Contains all Spark configuration parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "SparkConfig",
   "description": "Contains all Spark configuration parameters.",
   "type": "object",
   "properties": {
      "binaries": {
         "$ref": "#/$defs/BinaryLocations"
      },
      "runtime": {
         "$ref": "#/$defs/SparkRuntimeParams",
         "default": {
            "executor_cores": 5,
            "executor_memory_gb": null,
            "driver_memory_gb": 10,
            "node_memory_overhead_gb": 10,
            "use_local_storage": false,
            "start_connect_server": false,
            "start_history_server": false,
            "start_thrift_server": false,
            "spark_log_level": null,
            "enable_dynamic_allocation": false,
            "shuffle_partition_multiplier": 1,
            "enable_hive_metastore": false,
            "enable_postgres_hive_metastore": false,
            "postgres_password": "b3812f74-0d88-4465-a581-be14d5354753",
            "python_path": null,
            "spark_defaults_template_file": null
         }
      },
      "directories": {
         "$ref": "#/$defs/RuntimeDirectories",
         "default": {
            "base": "/home/runner/work/sparkctl/sparkctl/docs",
            "spark_scratch": "/home/runner/work/sparkctl/sparkctl/docs/spark_scratch",
            "metastore_dir": "/home/runner/work/sparkctl/sparkctl/docs"
         }
      },
      "compute": {
         "$ref": "#/$defs/ComputeParams",
         "default": {
            "environment": "slurm",
            "postgres": {
               "setup_metastore": "postgres/setup_metastore.sh",
               "start_container": "postgres/start_container.sh",
               "stop_container": "postgres/stop_container.sh"
            }
         }
      },
      "resource_monitor": {
         "$ref": "#/$defs/ResourceMonitorConfig",
         "default": {
            "cpu": true,
            "disk": true,
            "memory": true,
            "network": true,
            "interval": 5,
            "enabled": false
         }
      },
      "app": {
         "$ref": "#/$defs/AppParams",
         "default": {
            "console_level": "INFO",
            "file_level": "DEBUG",
            "reraise_exceptions": false
         }
      }
   },
   "$defs": {
      "AppParams": {
         "additionalProperties": false,
         "properties": {
            "console_level": {
               "default": "INFO",
               "description": "Console log level",
               "title": "Console Level",
               "type": "string"
            },
            "file_level": {
               "default": "DEBUG",
               "description": "File log level",
               "title": "File Level",
               "type": "string"
            },
            "reraise_exceptions": {
               "default": false,
               "description": "Reraise sparkctl exceptions in the CLI handler. Not recommended for users. Useful for developers when debugging issues.",
               "title": "Reraise Exceptions",
               "type": "boolean"
            }
         },
         "title": "AppParams",
         "type": "object"
      },
      "BinaryLocations": {
         "additionalProperties": false,
         "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
         "properties": {
            "spark_path": {
               "description": "Path to the Spark binaries.",
               "format": "path",
               "title": "Spark Path",
               "type": "string"
            },
            "java_path": {
               "description": "Path to the Java binaries.",
               "format": "path",
               "title": "Java Path",
               "type": "string"
            },
            "hadoop_path": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hadoop binaries.",
               "title": "Hadoop Path"
            },
            "hive_tarball": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hive binaries.",
               "title": "Hive Tarball"
            },
            "postgresql_jar_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the PostgreSQL jar file.",
               "title": "Postgresql Jar File"
            }
         },
         "required": [
            "spark_path",
            "java_path"
         ],
         "title": "BinaryLocations",
         "type": "object"
      },
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "ComputeParams": {
         "additionalProperties": false,
         "properties": {
            "environment": {
               "$ref": "#/$defs/ComputeEnvironment",
               "default": "slurm"
            },
            "postgres": {
               "$ref": "#/$defs/PostgresScripts",
               "default": {
                  "start_container": "postgres/start_container.sh",
                  "stop_container": "postgres/stop_container.sh",
                  "setup_metastore": "postgres/setup_metastore.sh"
               }
            }
         },
         "title": "ComputeParams",
         "type": "object"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      },
      "ResourceMonitorConfig": {
         "additionalProperties": false,
         "description": "Defines the resource stats to monitor.",
         "properties": {
            "cpu": {
               "default": true,
               "description": "Monitor CPU utilization",
               "title": "Cpu",
               "type": "boolean"
            },
            "disk": {
               "default": true,
               "description": "Monitor disk/storage utilization",
               "title": "Disk",
               "type": "boolean"
            },
            "memory": {
               "default": true,
               "description": "Monitor memory utilization",
               "title": "Memory",
               "type": "boolean"
            },
            "network": {
               "default": true,
               "description": "Monitor network utilization",
               "title": "Network",
               "type": "boolean"
            },
            "interval": {
               "default": 5,
               "description": "Interval in seconds on which to collect stats",
               "title": "Interval",
               "type": "integer"
            },
            "enabled": {
               "default": false,
               "description": "Enable resource monitoring.",
               "title": "Enabled",
               "type": "boolean"
            }
         },
         "title": "ResourceMonitorConfig",
         "type": "object"
      },
      "RuntimeDirectories": {
         "additionalProperties": false,
         "description": "Defines the directories to be used by a Spark cluster.",
         "properties": {
            "base": {
               "default": ".",
               "description": "Base directory for the cluster configuration",
               "format": "path",
               "title": "Base",
               "type": "string"
            },
            "spark_scratch": {
               "default": "spark_scratch",
               "description": "Directory to use for shuffle data.",
               "format": "path",
               "title": "Spark Scratch",
               "type": "string"
            },
            "metastore_dir": {
               "default": ".",
               "description": "Set a custom directory for the metastore and warehouse.",
               "format": "path",
               "title": "Metastore Dir",
               "type": "string"
            }
         },
         "title": "RuntimeDirectories",
         "type": "object"
      },
      "SparkRuntimeParams": {
         "additionalProperties": false,
         "description": "Controls Spark runtime parameters.",
         "properties": {
            "executor_cores": {
               "default": 5,
               "description": "Number of cores per executor",
               "title": "Executor Cores",
               "type": "integer"
            },
            "executor_memory_gb": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
               "title": "Executor Memory Gb"
            },
            "driver_memory_gb": {
               "default": 10,
               "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
               "title": "Driver Memory Gb",
               "type": "integer"
            },
            "node_memory_overhead_gb": {
               "default": 10,
               "description": "Memory to reserve for system processes.",
               "title": "Node Memory Overhead Gb",
               "type": "integer"
            },
            "use_local_storage": {
               "default": false,
               "description": "Use compute node local storage for shuffle data.",
               "title": "Use Local Storage",
               "type": "boolean"
            },
            "start_connect_server": {
               "default": false,
               "description": "Enable the Spark connect server.",
               "title": "Start Connect Server",
               "type": "boolean"
            },
            "start_history_server": {
               "default": false,
               "description": "Enable the Spark history server.",
               "title": "Start History Server",
               "type": "boolean"
            },
            "start_thrift_server": {
               "default": false,
               "description": "Enable the Thrift server to connect a SQL client.",
               "title": "Start Thrift Server",
               "type": "boolean"
            },
            "spark_log_level": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
               "title": "Spark Log Level"
            },
            "enable_dynamic_allocation": {
               "default": false,
               "description": "Enable Spark dynamic resource allocation.",
               "title": "Enable Dynamic Allocation",
               "type": "boolean"
            },
            "shuffle_partition_multiplier": {
               "default": 1,
               "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
               "title": "Shuffle Partition Multiplier",
               "type": "integer"
            },
            "enable_hive_metastore": {
               "default": false,
               "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
               "title": "Enable Hive Metastore",
               "type": "boolean"
            },
            "enable_postgres_hive_metastore": {
               "default": false,
               "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
               "title": "Enable Postgres Hive Metastore",
               "type": "boolean"
            },
            "postgres_password": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Password for PostgreSQL.",
               "title": "Postgres Password"
            },
            "python_path": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
               "title": "Python Path"
            },
            "spark_defaults_template_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
               "title": "Spark Defaults Template File"
            }
         },
         "title": "SparkRuntimeParams",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "binaries"
   ]
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
field app: AppParams = AppParams(console_level='INFO', file_level='DEBUG', reraise_exceptions=False)
field binaries: BinaryLocations [Required]
field compute: ComputeParams = ComputeParams(environment=<ComputeEnvironment.SLURM: 'slurm'>, postgres=PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh'))
field directories: RuntimeDirectories = RuntimeDirectories(base=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'), spark_scratch=PosixPath('/home/runner/work/sparkctl/sparkctl/docs/spark_scratch'), metastore_dir=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'))
field resource_monitor: ResourceMonitorConfig = ResourceMonitorConfig(cpu=True, disk=True, memory=True, network=True, interval=5, enabled=False)
field runtime: SparkRuntimeParams = SparkRuntimeParams(executor_cores=5, executor_memory_gb=None, driver_memory_gb=10, node_memory_overhead_gb=10, use_local_storage=False, start_connect_server=False, start_history_server=False, start_thrift_server=False, spark_log_level=None, enable_dynamic_allocation=False, shuffle_partition_multiplier=1, enable_hive_metastore=False, enable_postgres_hive_metastore=False, postgres_password='b3812f74-0d88-4465-a581-be14d5354753', python_path=None, spark_defaults_template_file=None)
pydantic model sparkctl.models.BinaryLocations

Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file are only required if the user wants to enable a Postgres-based Hive metastore.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "BinaryLocations",
   "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
   "type": "object",
   "properties": {
      "spark_path": {
         "description": "Path to the Spark binaries.",
         "format": "path",
         "title": "Spark Path",
         "type": "string"
      },
      "java_path": {
         "description": "Path to the Java binaries.",
         "format": "path",
         "title": "Java Path",
         "type": "string"
      },
      "hadoop_path": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hadoop binaries.",
         "title": "Hadoop Path"
      },
      "hive_tarball": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hive binaries.",
         "title": "Hive Tarball"
      },
      "postgresql_jar_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the PostgreSQL jar file.",
         "title": "Postgresql Jar File"
      }
   },
   "additionalProperties": false,
   "required": [
      "spark_path",
      "java_path"
   ]
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field hadoop_path: Path | None = None

Path to the Hadoop binaries.

Validated by:
field hive_tarball: Path | None = None

Path to the Hive binaries.

Validated by:
field java_path: Path [Required]

Path to the Java binaries.

Validated by:
field postgresql_jar_file: Path | None = None

Path to the PostgreSQL jar file.

Validated by:
field spark_path: Path [Required]

Path to the Spark binaries.

Validated by:
validator make_absolute  »  java_path, hadoop_path, postgresql_jar_file, spark_path, hive_tarball
pydantic model sparkctl.models.ComputeParams

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "ComputeParams",
   "type": "object",
   "properties": {
      "environment": {
         "$ref": "#/$defs/ComputeEnvironment",
         "default": "slurm"
      },
      "postgres": {
         "$ref": "#/$defs/PostgresScripts",
         "default": {
            "start_container": "postgres/start_container.sh",
            "stop_container": "postgres/stop_container.sh",
            "setup_metastore": "postgres/setup_metastore.sh"
         }
      }
   },
   "$defs": {
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
field environment: ComputeEnvironment = ComputeEnvironment.SLURM
field postgres: PostgresScripts = PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh')
pydantic model sparkctl.models.SparkRuntimeParams

Controls Spark runtime parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "SparkRuntimeParams",
   "description": "Controls Spark runtime parameters.",
   "type": "object",
   "properties": {
      "executor_cores": {
         "default": 5,
         "description": "Number of cores per executor",
         "title": "Executor Cores",
         "type": "integer"
      },
      "executor_memory_gb": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
         "title": "Executor Memory Gb"
      },
      "driver_memory_gb": {
         "default": 10,
         "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
         "title": "Driver Memory Gb",
         "type": "integer"
      },
      "node_memory_overhead_gb": {
         "default": 10,
         "description": "Memory to reserve for system processes.",
         "title": "Node Memory Overhead Gb",
         "type": "integer"
      },
      "use_local_storage": {
         "default": false,
         "description": "Use compute node local storage for shuffle data.",
         "title": "Use Local Storage",
         "type": "boolean"
      },
      "start_connect_server": {
         "default": false,
         "description": "Enable the Spark connect server.",
         "title": "Start Connect Server",
         "type": "boolean"
      },
      "start_history_server": {
         "default": false,
         "description": "Enable the Spark history server.",
         "title": "Start History Server",
         "type": "boolean"
      },
      "start_thrift_server": {
         "default": false,
         "description": "Enable the Thrift server to connect a SQL client.",
         "title": "Start Thrift Server",
         "type": "boolean"
      },
      "spark_log_level": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
         "title": "Spark Log Level"
      },
      "enable_dynamic_allocation": {
         "default": false,
         "description": "Enable Spark dynamic resource allocation.",
         "title": "Enable Dynamic Allocation",
         "type": "boolean"
      },
      "shuffle_partition_multiplier": {
         "default": 1,
         "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
         "title": "Shuffle Partition Multiplier",
         "type": "integer"
      },
      "enable_hive_metastore": {
         "default": false,
         "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
         "title": "Enable Hive Metastore",
         "type": "boolean"
      },
      "enable_postgres_hive_metastore": {
         "default": false,
         "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
         "title": "Enable Postgres Hive Metastore",
         "type": "boolean"
      },
      "postgres_password": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Password for PostgreSQL.",
         "title": "Postgres Password"
      },
      "python_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
         "title": "Python Path"
      },
      "spark_defaults_template_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
         "title": "Spark Defaults Template File"
      }
   },
   "additionalProperties": false
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field driver_memory_gb: int = 10

Driver memory in GB. This is the maximum amount of data that can be pulled into the application.

field enable_dynamic_allocation: bool = False

Enable Spark dynamic resource allocation.

field enable_hive_metastore: bool = False

Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.

field enable_postgres_hive_metastore: bool = False

Create a metastore with PostgreSQL. Supports multiple Spark sessions.

field executor_cores: int = 5

Number of cores per executor

field executor_memory_gb: int | None = None

Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.

field node_memory_overhead_gb: int = 10

Memory to reserve for system processes.

field postgres_password: str | None = None

Password for PostgreSQL.

Validated by:
field python_path: str | None = None

Python path to set for Spark workers. Use the Python inside the Spark distribution by default.

field shuffle_partition_multiplier: int = 1

Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)

field spark_defaults_template_file: Path | None = None

Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.

field spark_log_level: str | None = None

Set the root log level for all Spark processes. Defaults to Spark’s defaults.

field start_connect_server: bool = False

Enable the Spark connect server.

field start_history_server: bool = False

Enable the Spark history server.

field start_thrift_server: bool = False

Enable the Thrift server to connect a SQL client.

field use_local_storage: bool = False

Use compute node local storage for shuffle data.

validator set_postgres_password  »  postgres_password
pydantic model sparkctl.models.RuntimeDirectories

Defines the directories to be used by a Spark cluster.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "RuntimeDirectories",
   "description": "Defines the directories to be used by a Spark cluster.",
   "type": "object",
   "properties": {
      "base": {
         "default": ".",
         "description": "Base directory for the cluster configuration",
         "format": "path",
         "title": "Base",
         "type": "string"
      },
      "spark_scratch": {
         "default": "spark_scratch",
         "description": "Directory to use for shuffle data.",
         "format": "path",
         "title": "Spark Scratch",
         "type": "string"
      },
      "metastore_dir": {
         "default": ".",
         "description": "Set a custom directory for the metastore and warehouse.",
         "format": "path",
         "title": "Metastore Dir",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field base: Path = PosixPath('.')

Base directory for the cluster configuration

Validated by:
field metastore_dir: Path = PosixPath('.')

Set a custom directory for the metastore and warehouse.

Validated by:
field spark_scratch: Path = PosixPath('spark_scratch')

Directory to use for shuffle data.

Validated by:
validator make_absolute  »  spark_scratch, metastore_dir, base
clean_spark_conf_dir() Path

Ensure that the Spark conf dir exists and is clean.

get_events_dir() Path

Return the file path to hive-site.xml

get_hive_site_file() Path

Return the file path to hive-site.xml

get_spark_conf_dir() Path

Return the Spark conf directory

get_spark_defaults_file() Path

Return the file path to spark-defaults.conf

get_spark_env_file() Path

Return the file path to spark-env.sh

get_spark_log_file() Path

Return the file path to log properties file

get_workers_file() Path

Return the file path to workers

class sparkctl.models.ComputeEnvironment(*values)

Defines the supported compute environments.