Public API Reference¶
This section documents the public interface for the TNP Statistic Library.
Helper Functions Interface¶
The primary way to use the library is through the convenient helper functions that provide a simple, type-safe interface for calculating statistical metrics.
All functions support both record-level and summary-level data formats with automatic validation and optimization.
Organized by Category¶
- Accuracy Metrics - Default accuracy, EAD accuracy, Hosmer-Lemeshow test, Jeffreys test
- Discrimination Metrics - AUC (Area Under Curve), Gini coefficient, Kolmogorov-Smirnov test, F1 score, F2 score
- Normality Testing - Shapiro-Wilk test for distribution normality assessment
- Summary Statistics - Mean, median calculations
Complete Function Reference¶
metrics ¶
Metrics package - Public helper functions for statistical calculations.
This package provides the main public interface for calculating statistical metrics. All metric classes are internal implementations and should not be used directly.
Example usage
describe ¶
Describe a metric's inputs and outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric_type
|
str
|
The metric type name (e.g., "auc"). |
required |
data_format
|
str | None
|
Optional format filter ("record" or "summary"). |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with metric_type, data_formats, and per-format metadata. |
list_metrics ¶
Return metric types mapped to supported data formats.
options: show_source: false heading_level: 3 group_by_category: false members_order: source filters: - "!^*"
Discoverability¶
Use the helper functions to explore available metrics and their inputs/outputs:
from tnp_statistic_library.metrics import describe, list_metrics
available = list_metrics()
auc_details = describe("auc")
Error Handling¶
Configuration resolution errors expose typed exceptions:
from tnp_statistic_library.errors import InvalidDataFormatError, UnknownMetricTypeError
try:
describe("mean", data_format="summary")
except InvalidDataFormatError as exc:
print(exc)
Recipes Interface¶
For batch processing and YAML-driven configurations, use the recipes module:
- Recipes Interface -
load_configuration_from_yaml()function for YAML-based metric execution
Data Formats¶
The library supports two main data formats to accommodate different analysis scenarios:
Record-Level Data¶
Each row represents an individual observation (customer, loan, transaction):
- Best for: Raw model outputs, individual predictions, detailed analysis
- Performance: Optimal for large datasets with Polars lazy evaluation
- Segmentation: Full flexibility for grouping and filtering
Example columns:
probability: Individual probability of default (0.0-1.0)default_flag: Binary outcome (0/1 or boolean)predicted_ead: Individual predicted exposure at defaultactual_ead: Individual actual exposure at default
Summary-Level Data¶
Each row represents pre-aggregated statistics for a segment:
- Best for: Portfolio summaries, pre-computed statistics, reporting
- Performance: Fast calculations on aggregated data
- Segmentation: Limited to existing segment definitions
Example columns:
mean_pd: Mean probability of default for the segment (0.0-1.0)defaults: Count of defaults in the segment (positive numbers or None for most metrics)volume: Total number of observations in the segment (positive numbers or None for most metrics)
Segmentation¶
All metrics support flexible segmentation through the segment parameter:
Basic Segmentation¶
# Group by single column
result = default_accuracy(
data=df,
data_format="record",
prob_def="probability",
default="default_flag",
segment=["region"]
)
Multi-Level Segmentation¶
# Group by multiple columns
result = mean(
data=df,
variable="exposure_amount",
segment=["region", "product_type"]
)
Performance¶
Optimization Tips¶
- Use Summary-Level Data: Generally faster due to Polars optimization
- Lazy Evaluation: Datasets are processed efficiently with lazy evaluation
- Batch Operations: Recipes execute multiple metrics in parallel
- Memory Management: Large datasets are streamed rather than loaded entirely
Best Practices¶
# Efficient: Let Polars handle the optimization
result = default_accuracy(
data=large_df.lazy(), # Use lazy frames for large data
data_format="record",
prob_def="probability",
default="default_flag"
)
# Less efficient: Pre-filtering reduces optimization opportunities
filtered_df = large_df.filter(pl.col("region") == "North")
result = default_accuracy(
data=filtered_df,
data_format="record",
prob_def="probability",
default="default_flag"
)
Memory Considerations¶
- Large Datasets: Use
pl.scan_csv()or similar scan functions - Multiple Metrics: Use YAML recipes for batch processing
- Segmentation: Prefer single-pass segmentation over multiple separate calls