Workflows Interface¶

The workflows module provides YAML-driven configuration for running multiple metrics as batches through a single public function.

This is the recommended approach for:

Batch processing of multiple metrics
Standardized metric configurations
Production environments with consistent setups
Fan-out expansion for generating multiple metrics from lists

The `load_configuration_from_yaml` Function¶

The workflows module exposes a single public function that loads and parses YAML configuration files into executable metric collections.

load_configuration_from_yaml ¶

load_configuration_from_yaml(
    yaml_file: str | Path,
) -> Configuration

Load configuration from a YAML file.

Parameters:

Name	Type	Description	Default
`yaml_file`	`str \| Path`	Path to YAML file or raw YAML string	required

Returns:

Type	Description
`Configuration`	Configuration object that can be used to collect metrics

Example

config = load_configuration_from_yaml("metrics.yaml")
results = config.metrics.collect_all()

Basic Usage Pattern¶

from tnp_statistic_library.workflows import load_configuration_from_yaml

# Load configuration from YAML file
config = load_configuration_from_yaml("metrics_config.yaml")

# Execute all metrics and collect results
results = config.metrics.collect_all()

# Convert to a single DataFrame for analysis
df = results.to_dataframe()
print(f"Executed {len(df)} metric results")

Input Options¶

The function accepts either file paths or raw YAML strings:

1. File Path (recommended)¶

from pathlib import Path

# Using string path
config = load_configuration_from_yaml("my_metrics.yaml")

# Using Path object
config = load_configuration_from_yaml(Path("configs/metrics.yaml"))

2. Raw YAML String¶

from tnp_statistic_library.workflows import load_configuration_from_yaml

yaml_content = """
datasets:
  my_data:
    location: "data.csv"

metrics:
  summary_stats:
    metric_type: mean
    config:
      name: ["average_value"]
      variable: "value_column"
      dataset: "my_data"
"""
config = load_configuration_from_yaml(yaml_content)

Configuration Object Structure¶

The returned Configuration object contains three main components:

Datasets (`config.datasets`)¶

Type: Datasets (dictionary-like mapping)
Purpose: Registry of all dataset definitions
Access: config.datasets.root["dataset_name"]

Metrics (`config.metrics`)¶

Type: MetricCollections (nested dictionary mapping)
Purpose: Organized collections of validated metrics
Structure: {collection_name: {metric_name: metric_instance}}

RAG Configurations (`config.rag`)¶

Type: RagConfiguration (optional)
Purpose: Red-Amber-Green threshold definitions for status reporting

Execution Workflow¶

The typical workflow involves three steps:

# 1. Load and validate configuration
config = load_configuration_from_yaml("metrics.yaml")

# 2. Execute all metrics (lazy evaluation)
results = config.metrics.collect_all()

# 3. Convert to DataFrame for analysis
df = results.to_dataframe()

Results Structure¶

config.metrics.collect_all() returns a MetricResultCollections object:

results = config.metrics.collect_all()

# Access individual collection results
for collection_name, collection_result in results.root.items():
    print(f"Collection: {collection_name}")

    # Access individual metric results in collection
    for metric_name, metric_result in collection_result.root.items():
        print(f"  Metric: {metric_name}")
        print(f"  Shape: {metric_result.dataframe.shape}")
        print(f"  Type: {metric_result.metric.metric_type}")

# Convert all results to single DataFrame
df = results.to_dataframe()

Advanced Usage Examples¶

Working with Fan-out Metrics¶

Fan-out expansion allows creating multiple metrics from lists:

yaml_content = """
datasets:
  sales_data:
    location: "sales.csv"

metrics:
  regional_means:
    metric_type: mean
    config:
      name: "regions"
      variable: "sales_amount"
      segment: [["region"], ["other_region"]]
      dataset: "sales_data"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()

assert len(results.root["regional_means"].root) == 2

Segmented Analysis¶

yaml_content = """
datasets:
  customer_data:
    location: "customers.csv"

metrics:
  segmented_analysis:
    metric_type: mean
    config:
      name: ["customer_value"]
      variable: "purchase_amount"
      segment: [["customer_tier"]]  # Group by customer tier
      dataset: "customer_data"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()
df = results.to_dataframe()

# Results will include separate rows for each customer_tier value
print(df.select(["metric_name", "customer_tier", "mean_value"]))

Multiple Metric Types¶

yaml_content = """
datasets:
  model_scores:
    location: "predictions.csv"

metrics:
  accuracy_check:
    metric_type: default_accuracy
    config:
      name: ["model_accuracy"]
      dataset: "model_scores"
      prob_def: "predicted_prob"
      default: "actual_default"

  auc_analysis:
    metric_type: auc
    config:
      name: ["model_auc"]
      dataset: "model_scores"
      prob_def: "predicted_prob"
      default: "actual_default"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()

# Execute both accuracy and AUC metrics
df = results.to_dataframe()
print(df.select(["metric_name", "metric_type"]))

Error Handling¶

The function performs comprehensive validation and will raise errors for:

Configuration Validation Errors¶

from pydantic import ValidationError

try:
    config = load_configuration_from_yaml("metrics.yaml")
    results = config.metrics.collect_all()
except ValidationError as e:
    print(f"Configuration validation failed: {e}")
    # Handle specific validation issues
    for error in e.errors():
        print(f"Field: {error['loc']}, Error: {error['msg']}")

File and YAML Errors¶

try:
    config = load_configuration_from_yaml("nonexistent.yaml")
except FileNotFoundError:
    print("YAML file not found")

try:
    config = load_configuration_from_yaml("invalid: yaml: content")
except Exception as e:
    print(f"YAML parsing failed: {e}")

Common Validation Issues¶

Dataset Reference Errors: Unknown dataset keys in metric configurations
Fan-out Mismatches: Lists in name and segment fields have different lengths
Missing Required Fields: Metric-specific required configuration missing
Column Validation: Required columns missing from datasets during execution

Performance Considerations¶

Lazy Evaluation: Metrics are not executed until collect_all() is called
Batch Processing: All metrics are executed efficiently in a single batch via polars.collect_all()
Memory Management: Large datasets are processed lazily until final collection

config = load_configuration_from_yaml("large_metrics.yaml")

# Configuration loaded and validated, but no data processing yet
print(f"Loaded {len(config.metrics.root)} metric collections")

# Data processing happens here
results = config.metrics.collect_all()

Workflows Guide - Complete documentation for YAML workflow configurations
Workflow Examples - Example YAML configurations and usage patterns
Schema Reference - Detailed YAML structure and validation rules