sparkctl API¶

sparkctl.config.make_default_spark_config() → SparkConfig¶: Return a SparkConfig created from the user’s config file.

class sparkctl.cluster_manager.ClusterManager(config: SparkConfig, status: StatusTracker | None = None)¶

Manages operation of the Spark cluster.

classmethod from_config(config: SparkConfig) → Self¶

Create a ClusterManager from a config instance.

Examples

>>> from sparkctl import ClusterManager, make_default_spark_config
>>> config = make_default_spark_config()
>>> config.runtime.start_connect_server = True
>>> mgr = ClusterManager.from_config(config)

See also

from_config

clean() → None¶: Delete all Spark runtime files in the directory.

configure() → None¶

Configure a Spark cluster based on the input parameters.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()

get_spark_session() → SparkSession¶

Return a SparkSession for the current cluster.

Examples

>>> spark = mgr.get_spark_session()
>>> spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]).show()

set_workers(workers: list[str]) → None¶

Set the workers for the cluster. Must be called after configure() and before start().

Parameters:: workers – Worker node names or IP addresses, will be used as ssh targets.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config(make_default_spark_config())
>>> mgr.configure()
>>> mgr.set_workers(["worker1", "worker2"])
>>> mgr.start()

get_workers() → list[str]¶: Return the current worker node names.

start(print_env_paths: bool = True) → None¶

Start the Spark cluster. The caller must have called configure() beforehand.

The environment variables SPARK_CONF_DIR and JAVA_HOME are set to correct values for the current process.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()

managed_cluster() → Generator[SparkSession, None, None]¶

Configure and start the Spark cluster, yield a SparkSession in a context manager, stop the cluster after exit.

The environment variables SPARK_CONF_DIR and JAVA_HOME are set to correct values for the current process while the context is active and cleared when complete.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> with mgr.managed_start() as spark:
    df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
    df.show()

stop() → None¶

Stop all Spark processes.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()
>>> mgr.stop()

pydantic model sparkctl.models.SparkConfig¶

Contains all Spark configuration parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "SparkConfig",
   "description": "Contains all Spark configuration parameters.",
   "type": "object",
   "properties": {
      "binaries": {
         "$ref": "#/$defs/BinaryLocations"
      },
      "runtime": {
         "$ref": "#/$defs/SparkRuntimeParams",
         "default": {
            "executor_cores": 5,
            "executor_memory_gb": null,
            "driver_memory_gb": 10,
            "node_memory_overhead_gb": 10,
            "use_local_storage": false,
            "start_connect_server": false,
            "start_history_server": false,
            "start_thrift_server": false,
            "spark_log_level": null,
            "enable_dynamic_allocation": false,
            "shuffle_partition_multiplier": 1,
            "enable_hive_metastore": false,
            "enable_postgres_hive_metastore": false,
            "postgres_password": "7b330f7c-e208-4932-98bc-7bcd96e7323f",
            "python_path": null,
            "spark_defaults_template_file": null
         }
      },
      "directories": {
         "$ref": "#/$defs/RuntimeDirectories",
         "default": {
            "base": "/home/runner/work/sparkctl/sparkctl/docs",
            "spark_scratch": "/home/runner/work/sparkctl/sparkctl/docs/spark_scratch",
            "metastore_dir": "/home/runner/work/sparkctl/sparkctl/docs"
         }
      },
      "compute": {
         "$ref": "#/$defs/ComputeParams",
         "default": {
            "environment": "slurm",
            "postgres": {
               "setup_metastore": "postgres/setup_metastore.sh",
               "start_container": "postgres/start_container.sh",
               "stop_container": "postgres/stop_container.sh"
            }
         }
      },
      "resource_monitor": {
         "$ref": "#/$defs/ResourceMonitorConfig",
         "default": {
            "cpu": true,
            "disk": true,
            "memory": true,
            "network": true,
            "interval": 5,
            "enabled": false
         }
      },
      "app": {
         "$ref": "#/$defs/AppParams",
         "default": {
            "console_level": "INFO",
            "file_level": "DEBUG",
            "reraise_exceptions": false
         }
      }
   },
   "$defs": {
      "AppParams": {
         "additionalProperties": false,
         "properties": {
            "console_level": {
               "default": "INFO",
               "description": "Console log level",
               "title": "Console Level",
               "type": "string"
            },
            "file_level": {
               "default": "DEBUG",
               "description": "File log level",
               "title": "File Level",
               "type": "string"
            },
            "reraise_exceptions": {
               "default": false,
               "description": "Reraise sparkctl exceptions in the CLI handler. Not recommended for users. Useful for developers when debugging issues.",
               "title": "Reraise Exceptions",
               "type": "boolean"
            }
         },
         "title": "AppParams",
         "type": "object"
      },
      "BinaryLocations": {
         "additionalProperties": false,
         "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
         "properties": {
            "spark_path": {
               "description": "Path to the Spark binaries.",
               "format": "path",
               "title": "Spark Path",
               "type": "string"
            },
            "java_path": {
               "description": "Path to the Java binaries.",
               "format": "path",
               "title": "Java Path",
               "type": "string"
            },
            "hadoop_path": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hadoop binaries.",
               "title": "Hadoop Path"
            },
            "hive_tarball": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hive binaries.",
               "title": "Hive Tarball"
            },
            "postgresql_jar_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the PostgreSQL jar file.",
               "title": "Postgresql Jar File"
            }
         },
         "required": [
            "spark_path",
            "java_path"
         ],
         "title": "BinaryLocations",
         "type": "object"
      },
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "ComputeParams": {
         "additionalProperties": false,
         "properties": {
            "environment": {
               "$ref": "#/$defs/ComputeEnvironment",
               "default": "slurm"
            },
            "postgres": {
               "$ref": "#/$defs/PostgresScripts",
               "default": {
                  "start_container": "postgres/start_container.sh",
                  "stop_container": "postgres/stop_container.sh",
                  "setup_metastore": "postgres/setup_metastore.sh"
               }
            }
         },
         "title": "ComputeParams",
         "type": "object"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      },
      "ResourceMonitorConfig": {
         "additionalProperties": false,
         "description": "Defines the resource stats to monitor.",
         "properties": {
            "cpu": {
               "default": true,
               "description": "Monitor CPU utilization",
               "title": "Cpu",
               "type": "boolean"
            },
            "disk": {
               "default": true,
               "description": "Monitor disk/storage utilization",
               "title": "Disk",
               "type": "boolean"
            },
            "memory": {
               "default": true,
               "description": "Monitor memory utilization",
               "title": "Memory",
               "type": "boolean"
            },
            "network": {
               "default": true,
               "description": "Monitor network utilization",
               "title": "Network",
               "type": "boolean"
            },
            "interval": {
               "default": 5,
               "description": "Interval in seconds on which to collect stats",
               "title": "Interval",
               "type": "integer"
            },
            "enabled": {
               "default": false,
               "description": "Enable resource monitoring.",
               "title": "Enabled",
               "type": "boolean"
            }
         },
         "title": "ResourceMonitorConfig",
         "type": "object"
      },
      "RuntimeDirectories": {
         "additionalProperties": false,
         "description": "Defines the directories to be used by a Spark cluster.",
         "properties": {
            "base": {
               "default": ".",
               "description": "Base directory for the cluster configuration",
               "format": "path",
               "title": "Base",
               "type": "string"
            },
            "spark_scratch": {
               "default": "spark_scratch",
               "description": "Directory to use for shuffle data.",
               "format": "path",
               "title": "Spark Scratch",
               "type": "string"
            },
            "metastore_dir": {
               "default": ".",
               "description": "Set a custom directory for the metastore and warehouse.",
               "format": "path",
               "title": "Metastore Dir",
               "type": "string"
            }
         },
         "title": "RuntimeDirectories",
         "type": "object"
      },
      "SparkRuntimeParams": {
         "additionalProperties": false,
         "description": "Controls Spark runtime parameters.",
         "properties": {
            "executor_cores": {
               "default": 5,
               "description": "Number of cores per executor",
               "title": "Executor Cores",
               "type": "integer"
            },
            "executor_memory_gb": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
               "title": "Executor Memory Gb"
            },
            "driver_memory_gb": {
               "default": 10,
               "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
               "title": "Driver Memory Gb",
               "type": "integer"
            },
            "node_memory_overhead_gb": {
               "default": 10,
               "description": "Memory to reserve for system processes.",
               "title": "Node Memory Overhead Gb",
               "type": "integer"
            },
            "use_local_storage": {
               "default": false,
               "description": "Use compute node local storage for shuffle data.",
               "title": "Use Local Storage",
               "type": "boolean"
            },
            "start_connect_server": {
               "default": false,
               "description": "Enable the Spark connect server.",
               "title": "Start Connect Server",
               "type": "boolean"
            },
            "start_history_server": {
               "default": false,
               "description": "Enable the Spark history server.",
               "title": "Start History Server",
               "type": "boolean"
            },
            "start_thrift_server": {
               "default": false,
               "description": "Enable the Thrift server to connect a SQL client.",
               "title": "Start Thrift Server",
               "type": "boolean"
            },
            "spark_log_level": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
               "title": "Spark Log Level"
            },
            "enable_dynamic_allocation": {
               "default": false,
               "description": "Enable Spark dynamic resource allocation.",
               "title": "Enable Dynamic Allocation",
               "type": "boolean"
            },
            "shuffle_partition_multiplier": {
               "default": 1,
               "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
               "title": "Shuffle Partition Multiplier",
               "type": "integer"
            },
            "enable_hive_metastore": {
               "default": false,
               "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
               "title": "Enable Hive Metastore",
               "type": "boolean"
            },
            "enable_postgres_hive_metastore": {
               "default": false,
               "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
               "title": "Enable Postgres Hive Metastore",
               "type": "boolean"
            },
            "postgres_password": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Password for PostgreSQL.",
               "title": "Postgres Password"
            },
            "python_path": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
               "title": "Python Path"
            },
            "spark_defaults_template_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
               "title": "Spark Defaults Template File"
            }
         },
         "title": "SparkRuntimeParams",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "binaries"
   ]
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

app (sparkctl.models.AppParams)
binaries (sparkctl.models.BinaryLocations)
compute (sparkctl.models.ComputeParams)
directories (sparkctl.models.RuntimeDirectories)
resource_monitor (sparkctl.models.ResourceMonitorConfig)
runtime (sparkctl.models.SparkRuntimeParams)

field app: AppParams = AppParams(console_level='INFO', file_level='DEBUG', reraise_exceptions=False)¶

field binaries: BinaryLocations [Required]¶

field compute: ComputeParams = ComputeParams(environment=<ComputeEnvironment.SLURM: 'slurm'>, postgres=PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh'))¶

field directories: RuntimeDirectories = RuntimeDirectories(base=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'), spark_scratch=PosixPath('/home/runner/work/sparkctl/sparkctl/docs/spark_scratch'), metastore_dir=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'))¶

field resource_monitor: ResourceMonitorConfig = ResourceMonitorConfig(cpu=True, disk=True, memory=True, network=True, interval=5, enabled=False)¶

field runtime: SparkRuntimeParams = SparkRuntimeParams(executor_cores=5, executor_memory_gb=None, driver_memory_gb=10, node_memory_overhead_gb=10, use_local_storage=False, start_connect_server=False, start_history_server=False, start_thrift_server=False, spark_log_level=None, enable_dynamic_allocation=False, shuffle_partition_multiplier=1, enable_hive_metastore=False, enable_postgres_hive_metastore=False, postgres_password='7b330f7c-e208-4932-98bc-7bcd96e7323f', python_path=None, spark_defaults_template_file=None)¶

pydantic model sparkctl.models.BinaryLocations¶

Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file are only required if the user wants to enable a Postgres-based Hive metastore.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "BinaryLocations",
   "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
   "type": "object",
   "properties": {
      "spark_path": {
         "description": "Path to the Spark binaries.",
         "format": "path",
         "title": "Spark Path",
         "type": "string"
      },
      "java_path": {
         "description": "Path to the Java binaries.",
         "format": "path",
         "title": "Java Path",
         "type": "string"
      },
      "hadoop_path": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hadoop binaries.",
         "title": "Hadoop Path"
      },
      "hive_tarball": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hive binaries.",
         "title": "Hive Tarball"
      },
      "postgresql_jar_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the PostgreSQL jar file.",
         "title": "Postgresql Jar File"
      }
   },
   "additionalProperties": false,
   "required": [
      "spark_path",
      "java_path"
   ]
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

hadoop_path (pathlib.Path | None)
hive_tarball (pathlib.Path | None)
java_path (pathlib.Path)
postgresql_jar_file (pathlib.Path | None)
spark_path (pathlib.Path)

Validators:

make_absolute » hadoop_path
make_absolute » hive_tarball
make_absolute » java_path
make_absolute » postgresql_jar_file
make_absolute » spark_path

field hadoop_path: Path | None = None¶

Path to the Hadoop binaries.

Validated by:

make_absolute

field hive_tarball: Path | None = None¶

Path to the Hive binaries.

Validated by:

make_absolute

field java_path: Path [Required]¶

Path to the Java binaries.

Validated by:

make_absolute

field postgresql_jar_file: Path | None = None¶

Path to the PostgreSQL jar file.

Validated by:

make_absolute

field spark_path: Path [Required]¶

Path to the Spark binaries.

Validated by:

make_absolute

validator make_absolute » spark_path, postgresql_jar_file, hive_tarball, java_path, hadoop_path¶

pydantic model sparkctl.models.ComputeParams¶

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "ComputeParams",
   "type": "object",
   "properties": {
      "environment": {
         "$ref": "#/$defs/ComputeEnvironment",
         "default": "slurm"
      },
      "postgres": {
         "$ref": "#/$defs/PostgresScripts",
         "default": {
            "start_container": "postgres/start_container.sh",
            "stop_container": "postgres/stop_container.sh",
            "setup_metastore": "postgres/setup_metastore.sh"
         }
      }
   },
   "$defs": {
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

environment (sparkctl.models.ComputeEnvironment)
postgres (sparkctl.models.PostgresScripts)

field environment: ComputeEnvironment = ComputeEnvironment.SLURM¶

field postgres: PostgresScripts = PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh')¶

pydantic model sparkctl.models.SparkRuntimeParams¶

Controls Spark runtime parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "SparkRuntimeParams",
   "description": "Controls Spark runtime parameters.",
   "type": "object",
   "properties": {
      "executor_cores": {
         "default": 5,
         "description": "Number of cores per executor",
         "title": "Executor Cores",
         "type": "integer"
      },
      "executor_memory_gb": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
         "title": "Executor Memory Gb"
      },
      "driver_memory_gb": {
         "default": 10,
         "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
         "title": "Driver Memory Gb",
         "type": "integer"
      },
      "node_memory_overhead_gb": {
         "default": 10,
         "description": "Memory to reserve for system processes.",
         "title": "Node Memory Overhead Gb",
         "type": "integer"
      },
      "use_local_storage": {
         "default": false,
         "description": "Use compute node local storage for shuffle data.",
         "title": "Use Local Storage",
         "type": "boolean"
      },
      "start_connect_server": {
         "default": false,
         "description": "Enable the Spark connect server.",
         "title": "Start Connect Server",
         "type": "boolean"
      },
      "start_history_server": {
         "default": false,
         "description": "Enable the Spark history server.",
         "title": "Start History Server",
         "type": "boolean"
      },
      "start_thrift_server": {
         "default": false,
         "description": "Enable the Thrift server to connect a SQL client.",
         "title": "Start Thrift Server",
         "type": "boolean"
      },
      "spark_log_level": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
         "title": "Spark Log Level"
      },
      "enable_dynamic_allocation": {
         "default": false,
         "description": "Enable Spark dynamic resource allocation.",
         "title": "Enable Dynamic Allocation",
         "type": "boolean"
      },
      "shuffle_partition_multiplier": {
         "default": 1,
         "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
         "title": "Shuffle Partition Multiplier",
         "type": "integer"
      },
      "enable_hive_metastore": {
         "default": false,
         "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
         "title": "Enable Hive Metastore",
         "type": "boolean"
      },
      "enable_postgres_hive_metastore": {
         "default": false,
         "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
         "title": "Enable Postgres Hive Metastore",
         "type": "boolean"
      },
      "postgres_password": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Password for PostgreSQL.",
         "title": "Postgres Password"
      },
      "python_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
         "title": "Python Path"
      },
      "spark_defaults_template_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
         "title": "Spark Defaults Template File"
      }
   },
   "additionalProperties": false
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

driver_memory_gb (int)
enable_dynamic_allocation (bool)
enable_hive_metastore (bool)
enable_postgres_hive_metastore (bool)
executor_cores (int)
executor_memory_gb (int | None)
node_memory_overhead_gb (int)
postgres_password (str | None)
python_path (str | None)
shuffle_partition_multiplier (int)
spark_defaults_template_file (pathlib.Path | None)
spark_log_level (str | None)
start_connect_server (bool)
start_history_server (bool)
start_thrift_server (bool)
use_local_storage (bool)

Validators:

set_postgres_password » postgres_password

field driver_memory_gb: int = 10¶: Driver memory in GB. This is the maximum amount of data that can be pulled into the application.

field enable_dynamic_allocation: bool = False¶: Enable Spark dynamic resource allocation.

field enable_hive_metastore: bool = False¶: Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.

field enable_postgres_hive_metastore: bool = False¶: Create a metastore with PostgreSQL. Supports multiple Spark sessions.

field executor_cores: int = 5¶: Number of cores per executor

field executor_memory_gb: int | None = None¶: Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.

field node_memory_overhead_gb: int = 10¶: Memory to reserve for system processes.

field postgres_password: str | None = None¶

Password for PostgreSQL.

Validated by:

set_postgres_password

field python_path: str | None = None¶: Python path to set for Spark workers. Use the Python inside the Spark distribution by default.

field shuffle_partition_multiplier: int = 1¶: Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)

field spark_defaults_template_file: Path | None = None¶: Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.

field spark_log_level: str | None = None¶: Set the root log level for all Spark processes. Defaults to Spark’s defaults.

field start_connect_server: bool = False¶: Enable the Spark connect server.

field start_history_server: bool = False¶: Enable the Spark history server.

field start_thrift_server: bool = False¶: Enable the Thrift server to connect a SQL client.

field use_local_storage: bool = False¶: Use compute node local storage for shuffle data.

validator set_postgres_password » postgres_password¶

pydantic model sparkctl.models.RuntimeDirectories¶

Defines the directories to be used by a Spark cluster.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "RuntimeDirectories",
   "description": "Defines the directories to be used by a Spark cluster.",
   "type": "object",
   "properties": {
      "base": {
         "default": ".",
         "description": "Base directory for the cluster configuration",
         "format": "path",
         "title": "Base",
         "type": "string"
      },
      "spark_scratch": {
         "default": "spark_scratch",
         "description": "Directory to use for shuffle data.",
         "format": "path",
         "title": "Spark Scratch",
         "type": "string"
      },
      "metastore_dir": {
         "default": ".",
         "description": "Set a custom directory for the metastore and warehouse.",
         "format": "path",
         "title": "Metastore Dir",
         "type": "string"
      }
   },
   "additionalProperties": false
}