Apache Spark Backend¶
Download Spark from https://spark.apache.org/downloads.html and install it. Spark provides startup scripts for UNIX operating systems (not Windows).
Install chronify with Spark support¶
$ pip install chronify[spark]
Installation on a development computer¶
Installation can be as simple as
$ tar -xzf spark-3.5.4-bin-hadoop3.tgz
$ export SPARK_HOME=$(pwd)/spark-3.5.4-bin-hadoop3
Start a Thrift server. This allows JDBC clients to send SQL queries to an in-process Spark cluster running in local mode.
$ $SPARK_HOME/sbin/start-thriftserver.sh --master=spark://$(hostname):7077
The URL to connect to this server is hive://localhost:10000/default
Installation on an HPC¶
The chronify development team uses these scripts to run Spark on NREL’s HPC.
Chronify Usage¶
This example creates a chronify Store with Spark as the backend and then adds a view to a Parquet file. Chronify will run its normal time checks.
First, create the Parquet file and chronify schema.
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
from chronify import DatetimeRange, Store, TableSchema, CsvTableSchema
initial_time = datetime(2020, 1, 1)
end_time = datetime(2020, 12, 31, 23)
resolution = timedelta(hours=1)
timestamps = pd.date_range(initial_time, end_time, freq=resolution, unit="us")
dfs = []
for i in range(1, 4):
df = pd.DataFrame(
{
"timestamp": timestamps,
"id": i,
"value": np.random.random(len(timestamps)),
}
)
dfs.append(df)
df = pd.concat(dfs)
df.to_parquet("data.parquet", index=False)
schema = TableSchema(
name="devices",
value_column="value",
time_config=DatetimeRange(
time_column="timestamp",
start=initial_time,
length=len(timestamps),
resolution=resolution,
),
time_array_id_columns=["id"],
)
from chronify import Store
store = Store.create_new_hive_store("hive://localhost:10000/default")
store.create_view_from_parquet("data.parquet")
Verify the data:
store.read_table(schema.name).head()
timestamp id value
0 2020-01-01 00:00:00 1 0.785399
1 2020-01-01 01:00:00 1 0.102756
2 2020-01-01 02:00:00 1 0.178587
3 2020-01-01 03:00:00 1 0.326194
4 2020-01-01 04:00:00 1 0.994851
Time configuration mapping¶
The primary use case for Spark is to map datasets that are larger than can be processed by DuckDB on one computer. In such a workflow a user would call
store.map_table_time_config(src_table_name, dst_schema, output_file="mapped_data.parquet")