Data Processors

Transform Functions

r2x_core.processors.apply_transformation(data_file, data)[source]

Apply appropriate transformation based on data type.

Parameters:
  • data (Any) – Raw data to transform.

  • data_file (DataFile) – Configuration with transformation instructions.

Returns:

Transformed data.

Return type:

Any

r2x_core.processors.register_transformation(data_types, func)[source]

Register a custom transformation function.

Parameters:
  • data_types (type or tuple of types) – Data type(s) this function can handle.

  • func (TransformFunction) – Function that takes (data, data_file) and returns transformed data.

Return type:

None

Examples

>>> def transform_my_data(data: MyType, data_file: DataFile) -> MyType:
...     # Custom transformation logic
...     return data
>>> register_transformation(MyType, transform_my_data)
r2x_core.processors.transform_tabular_data(data_file, data)[source]

Transform tabular data to LazyFrame with applied transformations.

Applies transformations in order:

lowercase -> drop -> rename -> pivot -> schema -> filter -> select

Parameters:
  • data_file (DataFile) – Configuration with transformation instructions.

  • data (pl.LazyFrame) – Input tabular data.

Returns:

Transformed lazy frame.

Return type:

pl.LazyFrame

Notes

Always returns a LazyFrame for consistent lazy evaluation.

r2x_core.processors.transform_json_data(data_file, data)[source]

Transform JSON/dict data using functional pipeline.

Applies transformations in order: rename → filter → select.

Parameters:
  • data (dict) – Input JSON/dict data.

  • data_file (DataFile) – Configuration with transformation instructions.

Returns:

Transformed dictionary.

Return type:

dict

Tabular Processors

r2x_core.processors.pl_lowercase(data_file, df)[source]

Convert all string columns to lowercase.

Parameters:
Return type:

LazyFrame

r2x_core.processors.pl_drop_columns(data_file, df)[source]

Drop specified columns if they exist.

Parameters:
Return type:

LazyFrame

r2x_core.processors.pl_rename_columns(data_file, df)[source]

Rename columns based on mapping.

Parameters:
Return type:

LazyFrame

r2x_core.processors.pl_cast_schema(data_file, df)[source]

Cast columns to specified data types.

Parameters:
Return type:

LazyFrame

r2x_core.processors.pl_apply_filters(data_file, df)[source]

Apply row filters.

Parameters:
Return type:

LazyFrame

r2x_core.processors.pl_select_columns(data_file, df)[source]

Select specific columns (index + value columns).

Parameters:
Return type:

LazyFrame

r2x_core.processors.pl_build_filter_expr(column, value)[source]

Build polars filter expression.

Parameters:
  • column (str)

  • value (Any)

Return type:

Expr

JSON Processors

r2x_core.processors.json_rename_keys(data_file, data)[source]

Rename keys based on column mapping.

Parameters:
  • data_file (DataFile)

  • data (dict[str, Any])

Return type:

dict[str, Any]

r2x_core.processors.json_apply_filters(data_file, data)[source]

Filter JSON data by key-value pairs.

Parameters:
  • data_file (DataFile)

  • data (dict[str, Any])

Return type:

dict[str, Any]

r2x_core.processors.json_select_keys(data_file, data)[source]

Select specific keys from JSON data.

Parameters:
  • data_file (DataFile)

  • data (dict[str, Any])

Return type:

dict[str, Any]

Usage Examples

Automatic Transformation

Transformations are applied automatically by DataReader:

from r2x_core import DataReader, DataFile

# Define data file with transformations
data_file = DataFile(
    name="generators",
    filepath="data/generators.csv",
    lowercase=True,
    drop_columns=["old_col"],
    column_mapping={"gen_id": "id", "gen_name": "name"},
    schema={"capacity": "Float64", "year": "Int64"},
    filters={"year": 2030},
    value_columns=["capacity", "name"],
)

# Transformations applied automatically
reader = DataReader()
data = reader.read_data_file(folder=".", data_file=data_file)
# Returns transformed LazyFrame with lowercase, dropped columns, renamed, cast, filtered, and selected

Manual Transformation

from r2x_core.processors import transform_tabular_data
import polars as pl

# Load raw data
df = pl.scan_csv("data/generators.csv")

# Apply transformations manually
transformed = transform_tabular_data(data_file, df)

# Collect results
result = transformed.collect()

Custom Transformation

Register a custom transformation for a new data type:

from r2x_core.processors import register_transformation
from r2x_core import DataFile

class MyDataType:
    def __init__(self, data):
        self.data = data

def transform_my_data(data_file: DataFile, data: MyDataType) -> MyDataType:
    """Custom transformation for MyDataType."""
    # Apply transformations
    transformed_data = data.data.upper()
    return MyDataType(transformed_data)

# Register the transformation
register_transformation(MyDataType, transform_my_data)

# Now apply_transformation will use it automatically
from r2x_core.processors import apply_transformation

my_data = MyDataType("hello")
transformed = apply_transformation(data_file, my_data)

Polars Filter Expressions

Build custom filter expressions:

from r2x_core.processors import pl_build_filter_expr
import polars as pl

# Simple value filter
expr1 = pl_build_filter_expr("year", 2030)
# Returns: pl.col("year") == 2030

# List filter (IN)
expr2 = pl_build_filter_expr("status", ["active", "planned"])
# Returns: pl.col("status").is_in(["active", "planned"])

# Datetime year filter
expr3 = pl_build_filter_expr("datetime", 2030)
# Returns: pl.col("datetime").dt.year() == 2030

# Datetime year list filter
expr4 = pl_build_filter_expr("datetime", [2030, 2035, 2040])
# Returns: pl.col("datetime").dt.year().is_in([2030, 2035, 2040])

# Apply filters to dataframe
df = pl.scan_csv("data.csv")
filtered = df.filter(expr1 & expr2)

Type System

The processors use Polars type strings for schema casting:

# Schema mapping in DataFile
schema = {
    "capacity": "Float64",      # Float
    "year": "Int64",            # Integer
    "name": "Utf8",             # String
    "active": "Boolean",        # Boolean
    "date": "Date",             # Date
    "datetime": "Datetime",     # Datetime
}

Supported Polars types include:

  • Numeric: Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64

  • String: Utf8, Categorical

  • Boolean: Boolean

  • Temporal: Date, Datetime, Duration, Time

  • Complex: List, Struct

Functional Design

The processors module uses functional programming patterns:

  • Pure functions: All transformations are side-effect free

  • Partial application: Bind DataFile to create reusable transforms

  • Function composition: Pipeline multiple transformations

  • Single dispatch: Automatic selection based on type

from functools import partial
from r2x_core.processors import pl_lowercase, pl_drop_columns

# Create bound transformation
lowercase_transform = partial(pl_lowercase, data_file)

# Apply to multiple dataframes
df1_transformed = lowercase_transform(df1)
df2_transformed = lowercase_transform(df2)

See Also