Apache Spark Backend

Download Spark from https://spark.apache.org/downloads.html and install it. Spark provides startup scripts for UNIX operating systems (not Windows).

Install chronify with Spark support

$ pip install chronify[spark]

Installation on a development computer

Installation can be as simple as

$ tar -xzf spark-3.5.4-bin-hadoop3.tgz
$ export SPARK_HOME=$(pwd)/spark-3.5.4-bin-hadoop3

Start a Thrift server. This allows JDBC clients to send SQL queries to an in-process Spark cluster running in local mode.

$ $SPARK_HOME/sbin/start-thriftserver.sh --master=spark://$(hostname):7077

The URL to connect to this server is hive://localhost:10000/default

Installation on an HPC

The chronify development team uses these scripts to run Spark on NREL’s HPC.

Chronify Usage

This example creates a chronify Store with Spark as the backend and then adds a view to a Parquet file. Chronify will run its normal time checks.

First, create the Parquet file and chronify schema.

from datetime import datetime, timedelta

import numpy as np
import pandas as pd
from chronify import DatetimeRange, Store, TableSchema, CsvTableSchema

initial_time = datetime(2020, 1, 1)
end_time = datetime(2020, 12, 31, 23)
resolution = timedelta(hours=1)
timestamps = pd.date_range(initial_time, end_time, freq=resolution, unit="us")
dfs = []
for i in range(1, 4):
    df = pd.DataFrame(
        {
            "timestamp": timestamps,
            "id": i,
            "value": np.random.random(len(timestamps)),
        }
    )
    dfs.append(df)
df = pd.concat(dfs)
df.to_parquet("data.parquet", index=False)
schema = TableSchema(
    name="devices",
    value_column="value",
    time_config=DatetimeRange(
        time_column="timestamp",
        start=initial_time,
        length=len(timestamps),
        resolution=resolution,
    ),
    time_array_id_columns=["id"],
)
from chronify import Store

store = Store.create_new_hive_store("hive://localhost:10000/default")
store.create_view_from_parquet("data.parquet")

Verify the data:

store.read_table(schema.name).head()
            timestamp  id     value
0 2020-01-01 00:00:00   1  0.785399
1 2020-01-01 01:00:00   1  0.102756
2 2020-01-01 02:00:00   1  0.178587
3 2020-01-01 03:00:00   1  0.326194
4 2020-01-01 04:00:00   1  0.994851

Time configuration mapping

The primary use case for Spark is to map datasets that are larger than can be processed by DuckDB on one computer. In such a workflow a user would call

store.map_table_time_config(src_table_name, dst_schema, output_file="mapped_data.parquet")