Skip to content

Configuration Overview

YAML Structure

Workflows configurations consist of two main sections:

Metrics Section

The metrics section defines the statistical calculations to perform:

metrics:
  unique_metric_id:
    metric_type: "metric_name"
    config:
      # Configuration fields specific to the metric type

Datasets Section

The datasets section defines data sources:

datasets:
  dataset_id:
    location: "path/to/file.csv"
    # Additional dataset configuration (future extensions)

Fan-out Expansion

Fan-out expansion allows you to create multiple metrics from a single configuration by providing lists for certain fields.

Rules for Fan-out

  1. Field Marking: Only fields marked as "fan-out" support list expansion
  2. Length Matching: All fan-out lists must have the same length
  3. Automatic Expansion: Each position in the lists creates one metric instance

Fan-out Fields

The following fields support fan-out expansion:

  • name: List of metric names
  • segment: List of segmentation configurations

Example: Basic Fan-out

metrics:
  accuracy_metrics:
    metric_type: "default_accuracy"
    config:
      name: ["overall_accuracy", "product_accuracy"]
      segment: [null, ["product_type"]]
      data_format: "record_level"
      prob_def: "probability"
      default: "default_flag"
      dataset: "loan_data"

This creates two metrics:

  1. overall_accuracy with no segmentation
  2. product_accuracy segmented by product_type

Example: Complex Fan-out

metrics:
  comprehensive_analysis:
    metric_type: "auc"
    config:
      name: ["total_auc", "region_auc", "product_region_auc"]
      segment: [null, ["region"], ["product_type", "region"]]
      data_format: "record_level"
      prob_def: "score"
      default: "default"
      dataset: "model_data"

This creates three AUC metrics with different segmentation strategies.

Segment Configuration

Segments define how to group data for analysis:

Segment Types

  • No Segmentation: null or omit the field
  • Single Column: ["column_name"]
  • Multiple Columns: ["col1", "col2", "col3"]

Example Segment Configurations

# No segmentation - analyze entire dataset
segment: null

# Single segmentation - group by product type
segment: ["product_type"]

# Multi-level segmentation - group by product and region
segment: ["product_type", "region"]

# Fan-out with different segmentation levels
segment: [null, ["region"], ["product_type", "region"]]

Dataset Configuration

Datasets define data sources that metrics can reference. The system uses Polars for efficient data processing and supports various file formats and custom data loaders.

Basic Dataset Structure

datasets:
  dataset_id:
    location: "path/to/file.csv"
    loader: "optional_loader_name" # Optional: explicit loader specification

Supported File Formats

The system automatically detects file formats based on file extensions:

Extension Format Polars Scanner
.csv Comma-separated values pl.scan_csv
.parquet Apache Parquet pl.scan_parquet
.ndjson Newline-delimited JSON pl.scan_ndjson
.ipc Apache Arrow IPC pl.scan_ipc
.feather Feather format pl.scan_ipc
.delta Delta Lake pl.scan_delta
.iceberg Apache Iceberg pl.scan_iceberg

Example Dataset Configurations

datasets:
  # CSV file with automatic format detection
  loan_data:
    location: "data/loans.csv"

  # Parquet file for efficient storage
  model_scores:
    location: "data/model_outputs.parquet"

  # Remote Parquet file
  external_data:
    location: "s3://bucket/data/scores.parquet"

  # JSON lines format
  event_data:
    location: "logs/events.ndjson"

  # Delta Lake table
  warehouse_data:
    location: "warehouse/customer_data"

Custom Data Loaders

For unsupported formats or data sources, you can create custom data loader plugins. The library provides two approaches: a simplified registration system for interactive use, and the traditional hook-based system for distributed plugins.

The easiest way to create custom loaders is using the registration system:

import polars as pl
from tnp_statistic_library.api import register_named_loader, reset_plugin_manager

# Reset for clean state
reset_plugin_manager()

def excel_loader(location: str) -> pl.LazyFrame:
    """Custom loader for Excel files."""
    if location.endswith('.xlsx') or location.endswith('.xls'):
        return pl.read_excel(location).lazy()
    else:
        raise ValueError(f"Excel loader can only handle .xlsx/.xls files, got: {location}")

# Register the loader
register_named_loader("excel", excel_loader)

# Now use it in datasets
from tnp_statistic_library._internal.datasets.datasets import Dataset
dataset = Dataset(location="data/sales.xlsx", loader="excel")

For expensive operations like API calls, use pl.defer():

from tnp_statistic_library.api import register_data_loader

def api_loader(location: str) -> pl.LazyFrame | None:
    """Load data from REST APIs."""
    if location.startswith("api://"):
        def fetch_data():
            import requests
            url = location.replace("api://", "https://")
            response = requests.get(url)
            data = response.json()
            return pl.from_records(data)

        return pl.defer(fetch_data, schema={"id": pl.Int64, "name": pl.String})
    return None

# Register the API loader 
register_data_loader("api", api_loader)

The registration system supports: - register_data_loader(name, loader_func, overwrite=False) - Register a new loader function - register_named_loader(name, loader_func, overwrite=False) - Register a named loader - list_data_loaders() - List all registered loaders - unregister_data_loader(name) - Remove a loader - clear_data_loaders() - Remove all loaders

Hook-based Plugins (For Distributed Packages)

For packages that distribute plugins, use the traditional hook system:

from pathlib import Path
import polars as pl
from pluggy import HookimplMarker

hookimpl = HookimplMarker("tnp_statistic_library")

class CustomLoader:
    @hookimpl(tryfirst=True)
    def data_loader(self, location: str) -> pl.LazyFrame | None:
        # Handle .xlsx files
        if Path(location).suffix.lower() == ".xlsx":
            # Use pl.defer for true lazy loading
            return pl.defer(
                lambda: pl.read_excel(location),
                schema={"column1": pl.String, "column2": pl.Float64}  # Define expected schema
            )
        return None

Important: When using pl.defer(), you must provide the expected schema. This allows Polars to optimize query planning without executing the deferred function.

Using Named Loaders

You can reference registered loaders by name in YAML configs:

datasets:
  excel_data:
    location: "data/spreadsheet.xlsx"
    loader: "excel" # References registered loader

  api_data:
    location: "api://example.com/data"
    loader: "api" # Custom API data loader

Dataset References in Metrics

Metrics reference datasets using the dataset field, which must match a key in the datasets section:

metrics:
  accuracy_analysis:
    metric_type: "default_accuracy"
    config:
      name: "overall_accuracy"
      dataset: "loan_data" # References the dataset below
      # ... other config fields

datasets:
  loan_data:
    location: "data/loans.csv"

Data Loading Behavior

The Dataset class provides lazy loading with the following resolution order:

  1. Named User Loaders: If loader is specified, check user-registered loaders first
  2. Named Plugin Loaders: If loader is specified, check hook-based plugin loaders
  3. Extension-based: Match file extension to built-in scanners
  4. Plugin Discovery: Query all registered data_loader plugins
  5. Error: Raise ValueError if no loader can handle the location

Error Handling

Common dataset-related errors:

  • "Unknown data loader 'loader_name'": Named loader plugin not found
  • "No loader for 'location'": No scanner or plugin can handle the file format
  • "Dataset is missing required columns": Referenced columns don't exist in loaded data

Performance Considerations

  • Lazy Loading: Datasets use Polars LazyFrames for efficient memory usage
  • Format Selection: Parquet is generally fastest for large datasets
  • Remote Data: Consider caching for frequently accessed remote files
  • Column Selection: Only load required columns when possible
  • True Lazy Loading: Use pl.defer() in custom loaders for expensive operations (API calls, database queries, complex file parsing) to ensure execution is deferred until data is actually needed

Validation Rules

The workflow system enforces several validation rules:

  1. Fan-out Length Consistency: All fan-out lists must have the same non-zero length
  2. Dataset References: All dataset references must exist in the datasets section
  3. Required Fields: Each metric type has required configuration fields
  4. Data Type Validation: Numeric fields are validated for appropriate ranges
  5. Column Existence: Referenced columns must exist in the dataset

Error Messages

Common validation errors and their meanings:

  • "fan-out lists must share the same non-zero length": Fan-out lists have different lengths
  • "Dataset is missing required columns": Referenced columns don't exist in the data
  • "Config validation failed": Missing required fields or invalid values