Skip to content

Recipes Interface

The recipes module provides YAML-driven configuration for running multiple metrics as batches through a single public function.

This is the recommended approach for:

  • Batch processing of multiple metrics
  • Standardized metric configurations
  • Production environments with consistent setups
  • Fan-out expansion for generating multiple metrics from lists

The load_configuration_from_yaml Function

The recipes module exposes a single public function that loads and parses YAML configuration files into executable metric collections.

load_configuration_from_yaml

load_configuration_from_yaml(
    yaml_file: str | Path,
) -> Configuration

Load configuration from a YAML file.

Parameters:

Name Type Description Default
yaml_file str | Path

Path to YAML file or raw YAML string

required

Returns:

Type Description
Configuration

Configuration object that can be used to collect metrics

Example
config = load_configuration_from_yaml("metrics.yaml")
results = config.collections.run()

Basic Usage Pattern

from tnp_statistic_library.recipes import load_configuration_from_yaml

# Load configuration from YAML file
config = load_configuration_from_yaml("metrics_config.yaml")

# Execute all metrics and collect results
results = config.collections.run()

# Convert to a single DataFrame for analysis
df = results.to_dataframe()
print(f"Executed {len(df)} metric results")

Migration Notes

  • config.collections.run() replaces the old collect_all().
  • Results are now MetricCollectionsResult.
  • Access metric config via metric_result.config (not metric_result.metric).

Input Options

The function accepts either file paths or raw YAML strings:

from pathlib import Path

# Using string path
config = load_configuration_from_yaml("my_metrics.yaml")

# Using Path object
config = load_configuration_from_yaml(Path("configs/metrics.yaml"))

2. Raw YAML String

from tnp_statistic_library.recipes import load_configuration_from_yaml

yaml_content = """
datasets:
  my_data:
    type: "csv"
    source: "data.csv"

collections:
  summary_stats:
    dataset: "my_data"
    metrics:
      - metric_type: mean
        data_format: record
        name: "average_value"
        variable: "value_column"
"""
config = load_configuration_from_yaml(yaml_content)

Configuration Object Structure

The returned Configuration object contains three main components:

Datasets (config.datasets)

  • Type: Datasets (dictionary-like mapping)
  • Purpose: Registry of all dataset definitions
  • Access: config.datasets.root["dataset_name"]

Collections (config.collections)

  • Type: Collections (nested dictionary mapping)
  • Purpose: Organized collections of validated metrics
  • Structure: {collection_name: {metric_name: metric_spec}}

RAG Configurations (config.rag)

  • Type: RagConfiguration (optional)
  • Purpose: Red-Amber-Green threshold definitions for status reporting

Execution Recipe

The typical recipe involves three steps:

# 1. Load and validate configuration
config = load_configuration_from_yaml("metrics.yaml")

# 2. Execute all metrics (lazy evaluation)
results = config.collections.run()

# 3. Convert to DataFrame for analysis
df = results.to_dataframe()

Results Structure

config.collections.run() returns a MetricCollectionsResult object:

results = config.collections.run()

# Collection-level metadata (includes library_version by default)
# Access individual collection results
print(results.metadata.library_provenance)
print(results.metadata.run_context)
print(results.metadata.config_provenance)
for collection_name, collection_result in results.root.items():
    print(f"Collection: {collection_name}")

    # Access individual metric results in collection
    for metric_name, metric_result in collection_result.root.items():
        print(f"  Metric: {metric_name}")
        print(f"  Shape: {metric_result.dataframe.shape}")
        print(f"  Type: {metric_result.config.type}")

# Convert all results to single DataFrame
df = results.to_dataframe()

Advanced Usage Examples

Working with Fan-out Metrics

Fan-out expansion allows creating multiple metrics from lists:

yaml_content = """
datasets:
  sales_data:
    type: "csv"
    source: "sales.csv"

collections:
  regional_means:
    dataset: "sales_data"
    metrics:
      - metric_type: mean
        data_format: record
        name: ["regions", "other_regions"]
        variable: "sales_amount"
        segment: [["region"], ["other_region"]]
"""

config = load_configuration_from_yaml(yaml_content)
results = config.collections.run()

assert len(results.root["regional_means"].root) == 2

Segmented Analysis

yaml_content = """
datasets:
  customer_data:
    type: "csv"
    source: "customers.csv"

collections:
  segmented_analysis:
    dataset: "customer_data"
    metrics:
      - metric_type: mean
        data_format: record
        name: "customer_value"
        variable: "purchase_amount"
        segment: ["customer_tier"]  # Group by customer tier
"""

config = load_configuration_from_yaml(yaml_content)
results = config.collections.run()
df = results.to_dataframe()

# Results will include separate rows for each customer_tier value
print(df.select(["metric_name", "customer_tier", "mean_value"]))

Multiple Metric Types

yaml_content = """
datasets:
  model_scores:
    type: "csv"
    source: "predictions.csv"

collections:
  model_metrics:
    dataset: "model_scores"
    metrics:
      - metric_type: default_accuracy
        data_format: record
        name: "model_accuracy"
        prob_def: "predicted_prob"
        default: "actual_default"

      - metric_type: auc
        data_format: record
        name: "model_auc"
        prob_def: "predicted_prob"
        default: "actual_default"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.collections.run()

# Execute both accuracy and AUC metrics
df = results.to_dataframe()
print(df.select(["metric_name", "metric_type"]))

Error Handling

The function performs comprehensive validation and will raise errors for:

Configuration Validation Errors

from pydantic import ValidationError

try:
    config = load_configuration_from_yaml("metrics.yaml")
    results = config.collections.run()
except ValidationError as e:
    print(f"Configuration validation failed: {e}")
    # Handle specific validation issues
    for error in e.errors():
        print(f"Field: {error['loc']}, Error: {error['msg']}")

File and YAML Errors

try:
    config = load_configuration_from_yaml("nonexistent.yaml")
except FileNotFoundError:
    print("YAML file not found")

try:
    config = load_configuration_from_yaml("invalid: yaml: content")
except Exception as e:
    print(f"YAML parsing failed: {e}")

Common Validation Issues

  1. Dataset Reference Errors: Unknown dataset keys in metric configurations
  2. Fan-out Mismatches: Lists in name and segment fields have different lengths
  3. Missing Required Fields: Metric-specific required configuration missing
  4. Column Validation: Required columns missing from datasets during execution

Performance Considerations

  • Lazy Evaluation: Metrics are not executed until run() is called
  • Batch Processing: All metrics are executed efficiently in a single batch via polars.collect_all()
  • Memory Management: Large datasets are processed lazily until final collection
config = load_configuration_from_yaml("large_metrics.yaml")

# Configuration loaded and validated, but no data processing yet
print(f"Loaded {len(config.collections.root)} metric collections")

# Data processing happens here
results = config.collections.run()