Skip to content

Workflows Interface

The workflows module provides YAML-driven configuration for running multiple metrics as batches through a single public function.

This is the recommended approach for:

  • Batch processing of multiple metrics
  • Standardized metric configurations
  • Production environments with consistent setups
  • Fan-out expansion for generating multiple metrics from lists

The load_configuration_from_yaml Function

The workflows module exposes a single public function that loads and parses YAML configuration files into executable metric collections.

load_configuration_from_yaml

load_configuration_from_yaml(
    yaml_file: str | Path,
) -> Configuration

Load configuration from a YAML file.

Parameters:

Name Type Description Default
yaml_file str | Path

Path to YAML file or raw YAML string

required

Returns:

Type Description
Configuration

Configuration object that can be used to collect metrics

Example
config = load_configuration_from_yaml("metrics.yaml")
results = config.metrics.collect_all()

Basic Usage Pattern

from tnp_statistic_library.workflows import load_configuration_from_yaml

# Load configuration from YAML file
config = load_configuration_from_yaml("metrics_config.yaml")

# Execute all metrics and collect results
results = config.metrics.collect_all()

# Convert to a single DataFrame for analysis
df = results.to_dataframe()
print(f"Executed {len(df)} metric results")

Input Options

The function accepts either file paths or raw YAML strings:

from pathlib import Path

# Using string path
config = load_configuration_from_yaml("my_metrics.yaml")

# Using Path object
config = load_configuration_from_yaml(Path("configs/metrics.yaml"))

2. Raw YAML String

from tnp_statistic_library.workflows import load_configuration_from_yaml

yaml_content = """
datasets:
  my_data:
    location: "data.csv"

metrics:
  summary_stats:
    metric_type: mean
    config:
      name: ["average_value"]
      variable: "value_column"
      dataset: "my_data"
"""
config = load_configuration_from_yaml(yaml_content)

Configuration Object Structure

The returned Configuration object contains three main components:

Datasets (config.datasets)

  • Type: Datasets (dictionary-like mapping)
  • Purpose: Registry of all dataset definitions
  • Access: config.datasets.root["dataset_name"]

Metrics (config.metrics)

  • Type: MetricCollections (nested dictionary mapping)
  • Purpose: Organized collections of validated metrics
  • Structure: {collection_name: {metric_name: metric_instance}}

RAG Configurations (config.rag)

  • Type: RagConfiguration (optional)
  • Purpose: Red-Amber-Green threshold definitions for status reporting

Execution Workflow

The typical workflow involves three steps:

# 1. Load and validate configuration
config = load_configuration_from_yaml("metrics.yaml")

# 2. Execute all metrics (lazy evaluation)
results = config.metrics.collect_all()

# 3. Convert to DataFrame for analysis
df = results.to_dataframe()

Results Structure

config.metrics.collect_all() returns a MetricResultCollections object:

results = config.metrics.collect_all()

# Access individual collection results
for collection_name, collection_result in results.root.items():
    print(f"Collection: {collection_name}")

    # Access individual metric results in collection
    for metric_name, metric_result in collection_result.root.items():
        print(f"  Metric: {metric_name}")
        print(f"  Shape: {metric_result.dataframe.shape}")
        print(f"  Type: {metric_result.metric.metric_type}")

# Convert all results to single DataFrame
df = results.to_dataframe()

Advanced Usage Examples

Working with Fan-out Metrics

Fan-out expansion allows creating multiple metrics from lists:

yaml_content = """
datasets:
  sales_data:
    location: "sales.csv"

metrics:
  regional_means:
    metric_type: mean
    config:
      name: "regions"
      variable: "sales_amount"
      segment: [["region"], ["other_region"]]
      dataset: "sales_data"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()

assert len(results.root["regional_means"].root) == 2

Segmented Analysis

yaml_content = """
datasets:
  customer_data:
    location: "customers.csv"

metrics:
  segmented_analysis:
    metric_type: mean
    config:
      name: ["customer_value"]
      variable: "purchase_amount"
      segment: [["customer_tier"]]  # Group by customer tier
      dataset: "customer_data"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()
df = results.to_dataframe()

# Results will include separate rows for each customer_tier value
print(df.select(["metric_name", "customer_tier", "mean_value"]))

Multiple Metric Types

yaml_content = """
datasets:
  model_scores:
    location: "predictions.csv"

metrics:
  accuracy_check:
    metric_type: default_accuracy
    config:
      name: ["model_accuracy"]
      dataset: "model_scores"
      prob_def: "predicted_prob"
      default: "actual_default"

  auc_analysis:
    metric_type: auc
    config:
      name: ["model_auc"]
      dataset: "model_scores"
      prob_def: "predicted_prob"
      default: "actual_default"
"""

config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()

# Execute both accuracy and AUC metrics
df = results.to_dataframe()
print(df.select(["metric_name", "metric_type"]))

Error Handling

The function performs comprehensive validation and will raise errors for:

Configuration Validation Errors

from pydantic import ValidationError

try:
    config = load_configuration_from_yaml("metrics.yaml")
    results = config.metrics.collect_all()
except ValidationError as e:
    print(f"Configuration validation failed: {e}")
    # Handle specific validation issues
    for error in e.errors():
        print(f"Field: {error['loc']}, Error: {error['msg']}")

File and YAML Errors

try:
    config = load_configuration_from_yaml("nonexistent.yaml")
except FileNotFoundError:
    print("YAML file not found")

try:
    config = load_configuration_from_yaml("invalid: yaml: content")
except Exception as e:
    print(f"YAML parsing failed: {e}")

Common Validation Issues

  1. Dataset Reference Errors: Unknown dataset keys in metric configurations
  2. Fan-out Mismatches: Lists in name and segment fields have different lengths
  3. Missing Required Fields: Metric-specific required configuration missing
  4. Column Validation: Required columns missing from datasets during execution

Performance Considerations

  • Lazy Evaluation: Metrics are not executed until collect_all() is called
  • Batch Processing: All metrics are executed efficiently in a single batch via polars.collect_all()
  • Memory Management: Large datasets are processed lazily until final collection
config = load_configuration_from_yaml("large_metrics.yaml")

# Configuration loaded and validated, but no data processing yet
print(f"Loaded {len(config.metrics.root)} metric collections")

# Data processing happens here
results = config.metrics.collect_all()