Configuration Overview¶

YAML Structure¶

Recipes configurations consist of two main sections (with optional RAG rules):

Collections Section¶

The collections section defines grouped metric specifications to run:

collections:
  analysis_group:
    dataset: dataset_id
    metrics:
      - metric_type: metric_name
        data_format: record

Datasets Section¶

The datasets section defines data sources:

datasets:
  dataset_id:
    type: "csv"
    source: "path/to/file.csv"
    # Additional dataset configuration (options/schema)

Fan-out Expansion¶

Fan-out expansion allows you to create multiple metrics from a single configuration by providing lists for certain fields.

Rules for Fan-out¶

Field Marking: Only fields marked as "fan-out" support list expansion
Length Matching: All fan-out lists must have the same length
Automatic Expansion: Each position in the lists creates one metric instance

Fan-out Fields¶

The following fields support fan-out expansion:

name: List of metric names
segment: List of segmentation configurations

Example: Basic Fan-out¶

collections:
  accuracy_metrics:
    metrics:
    - name:
      - overall_accuracy
      - product_accuracy
      segment:
      - null
      - - product_type
      data_format: record
      prob_def: probability
      default: default_flag
      metric_type: default_accuracy
    dataset: loan_data

This creates two metrics:

overall_accuracy with no segmentation
product_accuracy segmented by product_type

Example: Complex Fan-out¶

collections:
  comprehensive_analysis:
    metrics:
    - name:
      - total_auc
      - region_auc
      - product_region_auc
      segment:
      - null
      - - region
      - - product_type
        - region
      data_format: record
      prob_def: score
      default: default
      metric_type: auc
    dataset: model_data

This creates three AUC metrics with different segmentation strategies.

Segment Configuration¶

Segments define how to group data for analysis:

Segment Types¶

No Segmentation: null or omit the field
Single Column: ["column_name"]
Multiple Columns: ["col1", "col2", "col3"]

Example Segment Configurations¶

# No segmentation - analyze entire dataset
segment: null

# Single segmentation - group by product type
segment: ["product_type"]

# Multi-level segmentation - group by product and region
segment: ["product_type", "region"]

# Fan-out with different segmentation levels
segment: [null, ["region"], ["product_type", "region"]]

Dataset Configuration¶

Datasets define data sources that metrics can reference. The system uses Polars for efficient data processing and supports various file formats and custom data loaders.

Basic Dataset Structure¶

datasets:
  dataset_id:
    type: "csv"
    source: "path/to/file.csv"
    options: {}  # Optional: loader-specific kwargs
    schema: {}   # Optional: column schema overrides

source can also be a list of paths or a glob string (e.g., "data/*.parquet"). In Python, the field name is schema_ (with schema as the YAML alias).

Built-in Loader Types¶

The system ships built-in loaders backed by Polars scan functions:

Extension	Format	Polars Scanner
`.csv`	Comma-separated values	`pl.scan_csv`
`.parquet`	Apache Parquet	`pl.scan_parquet`
`.ndjson`	Newline-delimited JSON	`pl.scan_ndjson`
`.ipc`	Apache Arrow IPC	`pl.scan_ipc`
`.feather`	Feather format	`pl.scan_ipc`
`.delta`	Delta Lake	`pl.scan_delta`
`.iceberg`	Apache Iceberg	`pl.scan_iceberg`

Example Dataset Configurations¶

datasets:
  # CSV file
  loan_data:
    type: "csv"
    source: "data/loans.csv"

  # Parquet file
  model_scores:
    type: "parquet"
    source: "data/model_outputs.parquet"

  # Remote Parquet file (S3)
  external_data:
    type: "parquet"
    source: "s3://bucket/data/scores.parquet"

  # JSON lines format
  event_data:
    type: "ndjson"
    source: "logs/events.ndjson"

  # Delta Lake table
  warehouse_data:
    type: "delta"
    source: "warehouse/customer_data"

Custom Data Loaders¶

For unsupported formats or data sources, you can create custom data loader plugins. The library provides two approaches: a simplified registration system for interactive use, and the traditional hook-based system for distributed plugins.

Simplified Registration (Recommended for Notebooks)¶

The easiest way to create custom loaders is using the registration system:

import polars as pl
from tnp_statistic_library.plugins import DatasetSpec, register_dataset_loader, reset_plugin_manager

# Reset for clean state
reset_plugin_manager()

def excel_loader(spec: DatasetSpec) -> pl.LazyFrame:
    """Custom loader for Excel files."""
    source = str(spec.source)
    if source.endswith(".xlsx") or source.endswith(".xls"):
        return pl.read_excel(source).lazy()
    raise ValueError(f"Excel loader can only handle .xlsx/.xls files, got: {source}")

# Register the loader
register_dataset_loader("excel", excel_loader)

# Now use it in datasets
from tnp_statistic_library._internal.datasets.datasets import Dataset
dataset = Dataset(type="excel", source="data/sales.xlsx")

For expensive operations like API calls, use pl.defer():

from tnp_statistic_library.plugins import DatasetSpec, register_dataset_loader

def api_loader(spec: DatasetSpec) -> pl.LazyFrame:
    """Load data from REST APIs."""
    source = str(spec.source)
    if not source.startswith("api://"):
        raise ValueError("API loader only supports api:// sources")

    def fetch_data():
        import requests

        url = source.replace("api://", "https://")
        response = requests.get(url)
        data = response.json()
        return pl.from_records(data)

    return pl.defer(fetch_data, schema={"id": pl.Int64, "name": pl.String})

# Register the API loader 
register_dataset_loader("api", api_loader)

The registration system supports: - register_dataset_loader(name, loader_func, overwrite=False) - Register a new loader function - list_dataset_loaders() - List all registered loaders - unregister_dataset_loader(name) - Remove a loader - clear_dataset_loaders() - Remove all user-registered loaders

Hook-based Plugins (For Distributed Packages)¶

For packages that distribute plugins, use the traditional hook system:

import polars as pl
from pluggy import HookimplMarker
from tnp_statistic_library.plugins import DatasetSpec, dataset_loader

hookimpl = HookimplMarker("tnp_statistic_library")

@dataset_loader("excel")
def scan_excel(spec: DatasetSpec) -> pl.LazyFrame:
    # Use pl.defer for true lazy loading
    return pl.defer(
        lambda: pl.read_excel(str(spec.source)),
        schema={"column1": pl.String, "column2": pl.Float64}  # Define expected schema
    )


class CustomLoaders:
    @hookimpl
    def dataset_loaders(self) -> list[object]:
        return [scan_excel]

Important: When using pl.defer(), you must provide the expected schema. This allows Polars to optimize query planning without executing the deferred function.

Using Custom Loaders¶

You can reference registered loaders by type in YAML configs:

datasets:
  excel_data:
    type: "excel"
    source: "data/spreadsheet.xlsx"

  api_data:
    type: "api"
    source: "api://example.com/data"

Dataset References in Metrics¶

Metrics reference datasets using the dataset field, which must match a key in the datasets section:

datasets:
  loan_data:
    type: csv
    source: data/loans.csv
collections:
  accuracy_analysis:
    metrics:
    - name: overall_accuracy
      metric_type: default_accuracy
      data_format: record
    dataset: loan_data

Data Loading Behavior¶

The Dataset class provides lazy loading with the following resolution order:

Built-in Loader: Resolve loader by type (csv/parquet/ndjson/ipc/delta/iceberg)
Plugin Loader: Resolve custom loader by type
Error: Raise ValueError if no loader is registered for the type

Error Handling¶

Common dataset-related errors:

"Unknown dataset loader type 'type'": Loader type not registered
"Dataset is missing required columns": Referenced columns don't exist in loaded data

Performance Considerations¶

Lazy Loading: Datasets use Polars LazyFrames for efficient memory usage
Format Selection: Parquet is generally fastest for large datasets
Remote Data: Consider caching for frequently accessed remote files
Column Selection: Only load required columns when possible
True Lazy Loading: Use pl.defer() in custom loaders for expensive operations (API calls, database queries, complex file parsing) to ensure execution is deferred until data is actually needed

Validation Rules¶

The recipe system enforces several validation rules:

Fan-out Length Consistency: All fan-out lists must have the same non-zero length
Dataset References: All dataset references must exist in the datasets section
Required Fields: Each metric type has required configuration fields
Data Type Validation: Numeric fields are validated for appropriate ranges
Column Existence: Referenced columns must exist in the dataset

Error Messages¶

Common validation errors and their meanings:

"fan-out lists must share the same non-zero length": Fan-out lists have different lengths
"Dataset is missing required columns": Referenced columns don't exist in the data
"Config validation failed": Missing required fields or invalid values