Apache Spark Backend¶

Download Spark from https://spark.apache.org/downloads.html and install it. Spark provides startup scripts for UNIX operating systems (not Windows).

Install chronify with Spark support¶

$ pip install chronify[spark]

Installation on a development computer¶

Installation can be as simple as

$ tar -xzf spark-3.5.4-bin-hadoop3.tgz
$ export SPARK_HOME=$(pwd)/spark-3.5.4-bin-hadoop3

Start a Thrift server. This allows JDBC clients to send SQL queries to an in-process Spark cluster running in local mode.

$ $SPARK_HOME/sbin/start-thriftserver.sh --master=spark://$(hostname):7077

The URL to connect to this server is hive://localhost:10000/default

Installation on an HPC¶

The chronify development team uses these scripts to run Spark on NREL’s HPC.

Chronify Usage¶

This example creates a chronify Store with Spark as the backend and then adds a view to a Parquet file. Chronify will run its normal time checks.

First, create the Parquet file and chronify schema.

from datetime import datetime, timedelta

import numpy as np
import pandas as pd
from chronify import DatetimeRange, Store, TableSchema, CsvTableSchema

initial_time = datetime(2020, 1, 1)
end_time = datetime(2020, 12, 31, 23)
resolution = timedelta(hours=1)
timestamps = pd.date_range(initial_time, end_time, freq=resolution, unit="us")
dfs = []
for i in range(1, 4):
    df = pd.DataFrame(
        {
            "timestamp": timestamps,
            "id": i,
            "value": np.random.random(len(timestamps)),
        }
    )
    dfs.append(df)
df = pd.concat(dfs)
df.to_parquet("data.parquet", index=False)
schema = TableSchema(
    name="devices",
    value_column="value",
    time_config=DatetimeRange(
        time_column="timestamp",
        start=initial_time,
        length=len(timestamps),
        resolution=resolution,
    ),
    time_array_id_columns=["id"],
)

from chronify import Store

store = Store.create_new_hive_store("hive://localhost:10000/default")
store.create_view_from_parquet("data.parquet")

Verify the data:

store.read_table(schema.name).head()

            timestamp  id     value
2020-01-01 00:00:00   1  0.785399
2020-01-01 01:00:00   1  0.102756
2020-01-01 02:00:00   1  0.178587
2020-01-01 03:00:00   1  0.326194
2020-01-01 04:00:00   1  0.994851

Time configuration mapping¶

The primary use case for Spark is to map datasets that are larger than can be processed by DuckDB on one computer. In such a workflow a user would call

store.map_table_time_config(src_table_name, dst_schema, output_file="mapped_data.parquet")