Recipes Interface¶

The recipes module provides YAML-driven configuration for running multiple metrics as batches through a single public function.

This is the recommended approach for:

Batch processing of multiple metrics
Standardized metric configurations
Production environments with consistent setups
Fan-out expansion for generating multiple metrics from lists

The `load_configuration_from_yaml` Function¶

The recipes module exposes a single public function that loads and parses YAML configuration files into executable metric collections.

load_configuration_from_yaml ¶

load_configuration_from_yaml(
    yaml_file: str | Path,
) -> Configuration

Load configuration from a YAML file.

Parameters:

Name	Type	Description	Default
`yaml_file`	`str \| Path`	Path to YAML file or raw YAML string	required

Returns:

Type	Description
`Configuration`	Configuration object that can be used to collect metrics

Example

config = load_configuration_from_yaml("metrics.yaml")
results = config.collections.run()

Basic Usage Pattern¶

from tnp_statistic_library.recipes import load_configuration_from_yaml

# Load configuration from YAML file
config = load_configuration_from_yaml("metrics_config.yaml")

# Execute all metrics and collect results
results = config.collections.run()

# Convert to a single DataFrame for analysis
df = results.to_dataframe()
print(f"Executed {len(df)} metric results")

Migration Notes¶

config.collections.run() replaces the old collect_all().
Results are now MetricCollectionsResult.
Access metric config via metric_result.config (not metric_result.metric).

Input Options¶

The function accepts either file paths or raw YAML strings:

1. File Path (recommended)¶

from pathlib import Path

# Using string path
config = load_configuration_from_yaml("my_metrics.yaml")

# Using Path object
config = load_configuration_from_yaml(Path("configs/metrics.yaml"))

2. Raw YAML String¶

from tnp_statistic_library.recipes import load_configuration_from_yaml

yaml_content = """
datasets:
  my_data:
    type: "csv"
    source: "data.csv"

collections:
  summary_stats:
    dataset: "my_data"
    metrics:
      - metric_type: mean
        data_format: record
        name: "average_value"
        variable: "value_column"
"""
config = load_configuration_from_yaml(yaml_content)

Configuration Object Structure¶

The returned Configuration object contains three main components:

Datasets (`config.datasets`)¶

Type: Datasets (dictionary-like mapping)
Purpose: Registry of all dataset definitions
Access: config.datasets.root["dataset_name"]

Collections (`config.collections`)¶

Type: Collections (nested dictionary mapping)
Purpose: Organized collections of validated metrics
Structure: {collection_name: {metric_name: metric_spec}}

RAG Configurations (`config.rag`)¶

Type: RagConfiguration (optional)
Purpose: Red-Amber-Green threshold definitions for status reporting

Execution Recipe¶

The typical recipe involves three steps:

# 1. Load and validate configuration
config = load_configuration_from_yaml("metrics.yaml")

# 2. Execute all metrics (lazy evaluation)
results = config.collections.run()

# 3. Convert to DataFrame for analysis
df = results.to_dataframe()

Results Structure¶

config.collections.run() returns a MetricCollectionsResult object:

results = config.collections.run()

# Collection-level metadata (includes library_version by default)
# Access individual collection results
print(results.metadata.library_provenance)
print(results.metadata.run_context)
print(results.metadata.config_provenance)
for collection_name, collection_result in results.root.items():
    print(f"Collection: {collection_name}")

    # Access individual metric results in collection
    for metric_name, metric_result in collection_result.root.items():
        print(f"  Metric: {metric_name}")
        print(f"  Shape: {metric_result.dataframe.shape}")
        print(f"  Type: {metric_result.config.type}")

# Convert all results to single DataFrame
df = results.to_dataframe()

Advanced Usage Examples¶

Working with Fan-out Metrics¶

Fan-out expansion allows creating multiple metrics from lists:

yaml_content = """
datasets:
  sales_data:
    type: "csv"
    source: "sales.csv"

collections:
  regional_means:
    dataset: "sales_data"
    metrics:
      - metric_type: mean
        data_format: record
        name: ["regions", "other_regions"]
        variable: "sales_amount"
        segment: [["region"], ["other_region"]]
"""

config = load_configuration_from_yaml(yaml_content)
results = config.collections.run()

assert len(results.root["regional_means"].root) == 2

Segmented Analysis¶

yaml_content = """
datasets:
  customer_data:
    type: "csv"
    source: "customers.csv"

collections:
  segmented_analysis:
    dataset: "customer_data"
    metrics:
      - metric_type: mean
        data_format: record
        name: "customer_value"
        variable: "purchase_amount"
        segment: ["customer_tier"]  # Group by customer tier
"""

config = load_configuration_from_yaml(yaml_content)
results = config.collections.run()
df = results.to_dataframe()

# Results will include separate rows for each customer_tier value
print(df.select(["metric_name", "customer_tier", "mean_value"]))

Multiple Metric Types¶

yaml_content = """
datasets:
  model_scores:
    type: "csv"
    source: "predictions.csv"

collections:
  model_metrics:
    dataset: "model_scores"
    metrics:
      - metric_type: default_accuracy
        data_format: record
        name: "model_accuracy"
        prob_def: "predicted_prob"
        default: "actual_default"

      - metric_type: auc
        data_format: record
        name: "model_auc"
        prob_def: "predicted_prob"
        default: "actual_default"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.collections.run()

# Execute both accuracy and AUC metrics
df = results.to_dataframe()
print(df.select(["metric_name", "metric_type"]))

Error Handling¶

The function performs comprehensive validation and will raise errors for:

Configuration Validation Errors¶

from pydantic import ValidationError

try:
    config = load_configuration_from_yaml("metrics.yaml")
    results = config.collections.run()
except ValidationError as e:
    print(f"Configuration validation failed: {e}")
    # Handle specific validation issues
    for error in e.errors():
        print(f"Field: {error['loc']}, Error: {error['msg']}")

File and YAML Errors¶

try:
    config = load_configuration_from_yaml("nonexistent.yaml")
except FileNotFoundError:
    print("YAML file not found")

try:
    config = load_configuration_from_yaml("invalid: yaml: content")
except Exception as e:
    print(f"YAML parsing failed: {e}")

Common Validation Issues¶

Dataset Reference Errors: Unknown dataset keys in metric configurations
Fan-out Mismatches: Lists in name and segment fields have different lengths
Missing Required Fields: Metric-specific required configuration missing
Column Validation: Required columns missing from datasets during execution

Performance Considerations¶

Lazy Evaluation: Metrics are not executed until run() is called
Batch Processing: All metrics are executed efficiently in a single batch via polars.collect_all()
Memory Management: Large datasets are processed lazily until final collection

config = load_configuration_from_yaml("large_metrics.yaml")

# Configuration loaded and validated, but no data processing yet
print(f"Loaded {len(config.collections.root)} metric collections")

# Data processing happens here
results = config.collections.run()

Recipes Guide - Complete documentation for YAML recipe configurations
Recipe Examples - Example YAML configurations and usage patterns
Schema Reference - Detailed YAML structure and validation rules