Workflows Interface¶
The workflows module provides YAML-driven configuration for running multiple metrics as batches through a single public function.
This is the recommended approach for:
- Batch processing of multiple metrics
- Standardized metric configurations
- Production environments with consistent setups
- Fan-out expansion for generating multiple metrics from lists
The load_configuration_from_yaml Function¶
The workflows module exposes a single public function that loads and parses YAML configuration files into executable metric collections.
load_configuration_from_yaml ¶
Load configuration from a YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
yaml_file
|
str | Path
|
Path to YAML file or raw YAML string |
required |
Returns:
| Type | Description |
|---|---|
Configuration
|
Configuration object that can be used to collect metrics |
Basic Usage Pattern¶
from tnp_statistic_library.workflows import load_configuration_from_yaml
# Load configuration from YAML file
config = load_configuration_from_yaml("metrics_config.yaml")
# Execute all metrics and collect results
results = config.metrics.collect_all()
# Convert to a single DataFrame for analysis
df = results.to_dataframe()
print(f"Executed {len(df)} metric results")
Input Options¶
The function accepts either file paths or raw YAML strings:
1. File Path (recommended)¶
from pathlib import Path
# Using string path
config = load_configuration_from_yaml("my_metrics.yaml")
# Using Path object
config = load_configuration_from_yaml(Path("configs/metrics.yaml"))
2. Raw YAML String¶
from tnp_statistic_library.workflows import load_configuration_from_yaml
yaml_content = """
datasets:
my_data:
location: "data.csv"
metrics:
summary_stats:
metric_type: mean
config:
name: ["average_value"]
variable: "value_column"
dataset: "my_data"
"""
config = load_configuration_from_yaml(yaml_content)
Configuration Object Structure¶
The returned Configuration object contains three main components:
Datasets (config.datasets)¶
- Type:
Datasets(dictionary-like mapping) - Purpose: Registry of all dataset definitions
- Access:
config.datasets.root["dataset_name"]
Metrics (config.metrics)¶
- Type:
MetricCollections(nested dictionary mapping) - Purpose: Organized collections of validated metrics
- Structure:
{collection_name: {metric_name: metric_instance}}
RAG Configurations (config.rag)¶
- Type:
RagConfiguration(optional) - Purpose: Red-Amber-Green threshold definitions for status reporting
Execution Workflow¶
The typical workflow involves three steps:
# 1. Load and validate configuration
config = load_configuration_from_yaml("metrics.yaml")
# 2. Execute all metrics (lazy evaluation)
results = config.metrics.collect_all()
# 3. Convert to DataFrame for analysis
df = results.to_dataframe()
Results Structure¶
config.metrics.collect_all() returns a MetricResultCollections object:
results = config.metrics.collect_all()
# Access individual collection results
for collection_name, collection_result in results.root.items():
print(f"Collection: {collection_name}")
# Access individual metric results in collection
for metric_name, metric_result in collection_result.root.items():
print(f" Metric: {metric_name}")
print(f" Shape: {metric_result.dataframe.shape}")
print(f" Type: {metric_result.metric.metric_type}")
# Convert all results to single DataFrame
df = results.to_dataframe()
Advanced Usage Examples¶
Working with Fan-out Metrics¶
Fan-out expansion allows creating multiple metrics from lists:
yaml_content = """
datasets:
sales_data:
location: "sales.csv"
metrics:
regional_means:
metric_type: mean
config:
name: "regions"
variable: "sales_amount"
segment: [["region"], ["other_region"]]
dataset: "sales_data"
"""
config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()
assert len(results.root["regional_means"].root) == 2
Segmented Analysis¶
yaml_content = """
datasets:
customer_data:
location: "customers.csv"
metrics:
segmented_analysis:
metric_type: mean
config:
name: ["customer_value"]
variable: "purchase_amount"
segment: [["customer_tier"]] # Group by customer tier
dataset: "customer_data"
"""
config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()
df = results.to_dataframe()
# Results will include separate rows for each customer_tier value
print(df.select(["metric_name", "customer_tier", "mean_value"]))
Multiple Metric Types¶
yaml_content = """
datasets:
model_scores:
location: "predictions.csv"
metrics:
accuracy_check:
metric_type: default_accuracy
config:
name: ["model_accuracy"]
dataset: "model_scores"
prob_def: "predicted_prob"
default: "actual_default"
auc_analysis:
metric_type: auc
config:
name: ["model_auc"]
dataset: "model_scores"
prob_def: "predicted_prob"
default: "actual_default"
"""
config = load_configuration_from_yaml(yaml_content)
results = config.metrics.collect_all()
# Execute both accuracy and AUC metrics
df = results.to_dataframe()
print(df.select(["metric_name", "metric_type"]))
Error Handling¶
The function performs comprehensive validation and will raise errors for:
Configuration Validation Errors¶
from pydantic import ValidationError
try:
config = load_configuration_from_yaml("metrics.yaml")
results = config.metrics.collect_all()
except ValidationError as e:
print(f"Configuration validation failed: {e}")
# Handle specific validation issues
for error in e.errors():
print(f"Field: {error['loc']}, Error: {error['msg']}")
File and YAML Errors¶
try:
config = load_configuration_from_yaml("nonexistent.yaml")
except FileNotFoundError:
print("YAML file not found")
try:
config = load_configuration_from_yaml("invalid: yaml: content")
except Exception as e:
print(f"YAML parsing failed: {e}")
Common Validation Issues¶
- Dataset Reference Errors: Unknown dataset keys in metric configurations
- Fan-out Mismatches: Lists in
nameandsegmentfields have different lengths - Missing Required Fields: Metric-specific required configuration missing
- Column Validation: Required columns missing from datasets during execution
Performance Considerations¶
- Lazy Evaluation: Metrics are not executed until
collect_all()is called - Batch Processing: All metrics are executed efficiently in a single batch via
polars.collect_all() - Memory Management: Large datasets are processed lazily until final collection
config = load_configuration_from_yaml("large_metrics.yaml")
# Configuration loaded and validated, but no data processing yet
print(f"Loaded {len(config.metrics.root)} metric collections")
# Data processing happens here
results = config.metrics.collect_all()
Related Documentation¶
- Workflows Guide - Complete documentation for YAML workflow configurations
- Workflow Examples - Example YAML configurations and usage patterns
- Schema Reference - Detailed YAML structure and validation rules