Public API Reference¶

This section documents the public interface for the TNP Statistic Library.

Helper Functions Interface¶

The primary way to use the library is through the convenient helper functions that provide a simple, type-safe interface for calculating statistical metrics.

All functions support both record-level and summary-level data formats with automatic validation and optimization.

Organized by Category¶

Accuracy Metrics - Default accuracy, EAD accuracy, Hosmer-Lemeshow test, Jeffreys test
Discrimination Metrics - AUC (Area Under Curve), Gini coefficient, Kolmogorov-Smirnov test, F1 score, F2 score
Normality Testing - Shapiro-Wilk test for distribution normality assessment
Summary Statistics - Mean, median calculations

Complete Function Reference¶

metrics ¶

Metrics package - Public helper functions for statistical calculations.

This package provides the main public interface for calculating statistical metrics. All metric classes are internal implementations and should not be used directly.

Example usage

from tnp_statistic_library.metrics import default_accuracy, mean, median

result = default_accuracy(
    data=df, data_format="record",
    prob_def="prob", default="default"
)

describe ¶

describe(
    metric_type: str, data_format: str | None = None
) -> dict[str, Any]

Describe a metric's inputs and outputs.

Parameters:

Name	Type	Description	Default
`metric_type`	`str`	The metric type name (e.g., "auc").	required
`data_format`	`str \| None`	Optional format filter ("record" or "summary").	`None`

Returns:

Type	Description
`dict[str, Any]`	Dict with metric_type, data_formats, and per-format metadata.

list_metrics ¶

list_metrics() -> dict[str, list[str]]

Return metric types mapped to supported data formats.

options: show_source: false heading_level: 3 group_by_category: false members_order: source filters: - "!^*"

Discoverability¶

Use the helper functions to explore available metrics and their inputs/outputs:

from tnp_statistic_library.metrics import describe, list_metrics

available = list_metrics()
auc_details = describe("auc")

Error Handling¶

Configuration resolution errors expose typed exceptions:

from tnp_statistic_library.errors import InvalidDataFormatError, UnknownMetricTypeError

try:
    describe("mean", data_format="summary")
except InvalidDataFormatError as exc:
    print(exc)

Recipes Interface¶

For batch processing and YAML-driven configurations, use the recipes module:

Recipes Interface - load_configuration_from_yaml() function for YAML-based metric execution

Data Formats¶

The library supports two main data formats to accommodate different analysis scenarios:

Record-Level Data¶

Each row represents an individual observation (customer, loan, transaction):

Best for: Raw model outputs, individual predictions, detailed analysis
Performance: Optimal for large datasets with Polars lazy evaluation
Segmentation: Full flexibility for grouping and filtering

Example columns:

probability: Individual probability of default (0.0-1.0)
default_flag: Binary outcome (0/1 or boolean)
predicted_ead: Individual predicted exposure at default
actual_ead: Individual actual exposure at default

Summary-Level Data¶

Each row represents pre-aggregated statistics for a segment:

Best for: Portfolio summaries, pre-computed statistics, reporting
Performance: Fast calculations on aggregated data
Segmentation: Limited to existing segment definitions

Example columns:

mean_pd: Mean probability of default for the segment (0.0-1.0)
defaults: Count of defaults in the segment (positive numbers or None for most metrics)
volume: Total number of observations in the segment (positive numbers or None for most metrics)

Segmentation¶

All metrics support flexible segmentation through the segment parameter:

Basic Segmentation¶

# Group by single column
result = default_accuracy(
    data=df,
    data_format="record",
    prob_def="probability",
    default="default_flag",
    segment=["region"]
)

Multi-Level Segmentation¶

# Group by multiple columns
result = mean(
    data=df,
    variable="exposure_amount",
    segment=["region", "product_type"]
)

Performance¶

Optimization Tips¶

Use Summary-Level Data: Generally faster due to Polars optimization
Lazy Evaluation: Datasets are processed efficiently with lazy evaluation
Batch Operations: Recipes execute multiple metrics in parallel
Memory Management: Large datasets are streamed rather than loaded entirely

Best Practices¶

# Efficient: Let Polars handle the optimization
result = default_accuracy(
    data=large_df.lazy(),  # Use lazy frames for large data
    data_format="record",
    prob_def="probability",
    default="default_flag"
)

# Less efficient: Pre-filtering reduces optimization opportunities
filtered_df = large_df.filter(pl.col("region") == "North")
result = default_accuracy(
    data=filtered_df,
    data_format="record",
    prob_def="probability",
    default="default_flag"
)

Memory Considerations¶

Large Datasets: Use pl.scan_csv() or similar scan functions
Multiple Metrics: Use YAML recipes for batch processing
Segmentation: Prefer single-pass segmentation over multiple separate calls