# Apache Spark Backend Download Spark from https://spark.apache.org/downloads.html and install it. Spark provides startup scripts for UNIX operating systems (not Windows). ## Install chronify with Spark support ``` $ pip install chronify[spark] ``` ## Installation on a development computer Installation can be as simple as ``` $ tar -xzf spark-3.5.4-bin-hadoop3.tgz $ export SPARK_HOME=$(pwd)/spark-3.5.4-bin-hadoop3 ``` Start a Thrift server. This allows JDBC clients to send SQL queries to an in-process Spark cluster running in local mode. ``` $ $SPARK_HOME/sbin/start-thriftserver.sh --master=spark://$(hostname):7077 ``` The URL to connect to this server is `hive://localhost:10000/default` ## Installation on an HPC The chronify development team uses these [scripts](https://github.com/NREL/HPC/tree/master/applications/spark) to run Spark on NREL's HPC. ## Chronify Usage This example creates a chronify Store with Spark as the backend and then adds a view to a Parquet file. Chronify will run its normal time checks. First, create the Parquet file and chronify schema. ```python from datetime import datetime, timedelta import numpy as np import pandas as pd from chronify import DatetimeRange, Store, TableSchema, CsvTableSchema initial_time = datetime(2020, 1, 1) end_time = datetime(2020, 12, 31, 23) resolution = timedelta(hours=1) timestamps = pd.date_range(initial_time, end_time, freq=resolution, unit="us") dfs = [] for i in range(1, 4): df = pd.DataFrame( { "timestamp": timestamps, "id": i, "value": np.random.random(len(timestamps)), } ) dfs.append(df) df = pd.concat(dfs) df.to_parquet("data.parquet", index=False) schema = TableSchema( name="devices", value_column="value", time_config=DatetimeRange( time_column="timestamp", start=initial_time, length=len(timestamps), resolution=resolution, ), time_array_id_columns=["id"], ) ``` ```python from chronify import Store store = Store.create_new_hive_store("hive://localhost:10000/default") store.create_view_from_parquet("data.parquet") ``` Verify the data: ```python store.read_table(schema.name).head() ``` ``` timestamp id value 0 2020-01-01 00:00:00 1 0.785399 1 2020-01-01 01:00:00 1 0.102756 2 2020-01-01 02:00:00 1 0.178587 3 2020-01-01 03:00:00 1 0.326194 4 2020-01-01 04:00:00 1 0.994851 ``` ## Time configuration mapping The primary use case for Spark is to map datasets that are larger than can be processed by DuckDB on one computer. In such a workflow a user would call ```python store.map_table_time_config(src_table_name, dst_schema, output_file="mapped_data.parquet") ```