Skip to content

Public API Reference

This section documents the public interface for the TNP Statistic Library.

Helper Functions Interface

The primary way to use the library is through the convenient helper functions that provide a simple, type-safe interface for calculating statistical metrics.

All functions support both record-level and summary-level data formats with automatic validation and optimization.

Organized by Category

Complete Function Reference

metrics

Metrics package - Public helper functions for statistical calculations.

This package provides the main public interface for calculating statistical metrics. All metric classes are internal implementations and should not be used directly.

Example usage
from tnp_statistic_library.metrics import default_accuracy, mean, median

result = default_accuracy(
    data=df, data_format="record",
    prob_def="prob", default="default"
)

describe

describe(
    metric_type: str, data_format: str | None = None
) -> dict[str, Any]

Describe a metric's inputs and outputs.

Parameters:

Name Type Description Default
metric_type str

The metric type name (e.g., "auc").

required
data_format str | None

Optional format filter ("record" or "summary").

None

Returns:

Type Description
dict[str, Any]

Dict with metric_type, data_formats, and per-format metadata.

list_metrics

list_metrics() -> dict[str, list[str]]

Return metric types mapped to supported data formats.

options: show_source: false heading_level: 3 group_by_category: false members_order: source filters: - "!^*"

Discoverability

Use the helper functions to explore available metrics and their inputs/outputs:

from tnp_statistic_library.metrics import describe, list_metrics

available = list_metrics()
auc_details = describe("auc")

Error Handling

Configuration resolution errors expose typed exceptions:

from tnp_statistic_library.errors import InvalidDataFormatError, UnknownMetricTypeError

try:
    describe("mean", data_format="summary")
except InvalidDataFormatError as exc:
    print(exc)

Recipes Interface

For batch processing and YAML-driven configurations, use the recipes module:

  • Recipes Interface - load_configuration_from_yaml() function for YAML-based metric execution

Data Formats

The library supports two main data formats to accommodate different analysis scenarios:

Record-Level Data

Each row represents an individual observation (customer, loan, transaction):

  • Best for: Raw model outputs, individual predictions, detailed analysis
  • Performance: Optimal for large datasets with Polars lazy evaluation
  • Segmentation: Full flexibility for grouping and filtering

Example columns:

  • probability: Individual probability of default (0.0-1.0)
  • default_flag: Binary outcome (0/1 or boolean)
  • predicted_ead: Individual predicted exposure at default
  • actual_ead: Individual actual exposure at default

Summary-Level Data

Each row represents pre-aggregated statistics for a segment:

  • Best for: Portfolio summaries, pre-computed statistics, reporting
  • Performance: Fast calculations on aggregated data
  • Segmentation: Limited to existing segment definitions

Example columns:

  • mean_pd: Mean probability of default for the segment (0.0-1.0)
  • defaults: Count of defaults in the segment (positive numbers or None for most metrics)
  • volume: Total number of observations in the segment (positive numbers or None for most metrics)

Segmentation

All metrics support flexible segmentation through the segment parameter:

Basic Segmentation

# Group by single column
result = default_accuracy(
    data=df,
    data_format="record",
    prob_def="probability",
    default="default_flag",
    segment=["region"]
)

Multi-Level Segmentation

# Group by multiple columns
result = mean(
    data=df,
    variable="exposure_amount",
    segment=["region", "product_type"]
)

Performance

Optimization Tips

  1. Use Summary-Level Data: Generally faster due to Polars optimization
  2. Lazy Evaluation: Datasets are processed efficiently with lazy evaluation
  3. Batch Operations: Recipes execute multiple metrics in parallel
  4. Memory Management: Large datasets are streamed rather than loaded entirely

Best Practices

# Efficient: Let Polars handle the optimization
result = default_accuracy(
    data=large_df.lazy(),  # Use lazy frames for large data
    data_format="record",
    prob_def="probability",
    default="default_flag"
)

# Less efficient: Pre-filtering reduces optimization opportunities
filtered_df = large_df.filter(pl.col("region") == "North")
result = default_accuracy(
    data=filtered_df,
    data_format="record",
    prob_def="probability",
    default="default_flag"
)

Memory Considerations

  • Large Datasets: Use pl.scan_csv() or similar scan functions
  • Multiple Metrics: Use YAML recipes for batch processing
  • Segmentation: Prefer single-pass segmentation over multiple separate calls