How to configure a Hive metastoreΒΆ
Spark supports reading and writing data from Apache Hive as described here.
This is useful if you want to access data in SQL tables through a JDBC/ODBC client
instead of Parquet files through Python/R or spark-sql
. It also provides access to Spark with
SQLAlchemy through
PyHive.
You can configure Spark with a Hive Metastore by running this command:
$ sparkctl configure --thrift-server --hive-metastore --metastore-dir /path/to/my/metastore
By default Spark uses Apache Derby as the database for the metastore. This has a limitation: only one client can be connected to the metastore at a time.
If you need multiple simultaneous connections to the metastore, you can use PostgreSQL as the backend instead by running the following command:
$ sparkctl configure --thrift-server --hive-metastore --postgres-hive-metastore --metastore-dir /path/to/my/metastore
This takes a few extra minutes to start the first time, as it has to download a container and start the server. Apptainer will cache the container image and you can reuse the database data across Slurm allocations.
Note: The metadata about your tables will be stored in Derby or Postgres. Your tables will
be stored on the filesystem (Parquet files by default) in a directory called spark_warehouse
,
which gets created in the directory passed to --metastore-dir
(current directory by default).
Postgres data, if enabled, will be in the same directory (pg_data
).