Skip to content

Getting Started

This tutorial will guide you through setting up and using the TNP Statistic Library with practical examples.

Installation

Standard Installation

Install the TNP Statistic Library using pip (assumes the package is available in your configured package index):

pip install tnp-statistic-library

Note: If pip is not available directly, you can use python -m pip install tnp-statistic-library instead.

Alternative Installation Methods

Installing from a Package Mirror

If your organization uses a private package mirror or you need to install from a specific package index:

# Install from a custom package index
pip install --index-url https://your-package-mirror.com/simple/ tnp-statistic-library

# Install from a custom index with fallback to PyPI
pip install --extra-index-url https://your-package-mirror.com/simple/ tnp-statistic-library

# Install with specific trusted host (if using HTTP)
pip install --trusted-host your-package-mirror.com --index-url http://your-package-mirror.com/simple/ tnp-statistic-library

# Alternative: Use python -m pip if pip command is not available
# python -m pip install --index-url https://your-package-mirror.com/simple/ tnp-statistic-library

Installing from a Wheel File

If you have downloaded a wheel (.whl) file or need to install from a local wheel:

# Install from a local wheel file
pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl

# Install from a wheel file with dependencies
pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl[all]

# Force reinstall from wheel file
pip install --force-reinstall path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl

# Alternative: Use python -m pip if pip command is not available
# python -m pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl

Building Distribution Files (For Developers)

If you're a developer who needs to create distribution files for system administrators:

# Clone the repository
git clone <repository-url>
cd tnp_statistic_library

# Build the distribution packages
uv build
# This creates both wheel (.whl) and source (.tar.gz) files in the dist/ directory

# Alternative: using standard build tool
# python -m build

# Check the created files
ls dist/
# Should show: tnp_statistic_library-X.Y.Z-py3-none-any.whl and tnp-statistic-library-X.Y.Z.tar.gz

Adding to a Package Mirror

If you're a system administrator wanting to add this library to your organization's package mirror:

  1. Obtain the distribution files from your development team:

  2. tnp_statistic_library-X.Y.Z-py3-none-any.whl (wheel file)

  3. tnp-statistic-library-X.Y.Z.tar.gz (source distribution)

  4. For DevPI (common Python package mirror):

# Upload to your DevPI index
devpi upload tnp_statistic_library-X.Y.Z-py3-none-any.whl
devpi upload tnp-statistic-library-X.Y.Z.tar.gz
  1. For Nexus Repository Manager:

  2. Upload the wheel and source files through the Nexus web interface

  3. Or use the REST API to programmatically upload packages

  4. For JFrog Artifactory:

# Using JFrog CLI
jf rt upload "tnp_statistic_library-*.whl" pypi-local/tnp-statistic-library/
jf rt upload "tnp-statistic-library-*.tar.gz" pypi-local/tnp-statistic-library/
  1. For simple file-based mirrors:
    # Copy packages to your package server
    scp tnp_statistic_library-*.whl user@package-server:/var/www/pypi/simple/tnp-statistic-library/
    scp tnp-statistic-library-*.tar.gz user@package-server:/var/www/pypi/simple/tnp-statistic-library/
    

Direct Wheel Distribution

For organizations that prefer to distribute wheel files directly without a package mirror:

  1. Obtain the wheel file from your development team:

  2. tnp_statistic_library-X.Y.Z-py3-none-any.whl

  3. Distribute the wheel file:

# Share via internal file server, email, or artifact repository
# Users can then install using:
# pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl
  1. For CI/CD pipelines:
    # Store as build artifact for download
    # Or publish to internal artifact storage (AWS S3, Azure Blob, etc.)
    

Verification

After installation, verify the library is working correctly:

from tnp_statistic_library.version import VERSION
print(f"TNP Statistic Library version: {VERSION}")

# Test basic functionality
from tnp_statistic_library.metrics import default_accuracy
print("Installation successful!")

Your First Example

Let's start with a complete example that demonstrates both approaches to using the library.

Creating Sample Data

First, we'll create a realistic financial dataset using Polars:

import polars as pl

# Create a sample portfolio dataset
df = pl.DataFrame({
    "customer_id": [f"CUST_{i:04d}" for i in range(1, 9)],
    "probability": [0.05, 0.15, 0.35, 0.60, 0.80, 0.25, 0.45, 0.10],
    "default_flag": [0, 0, 0, 1, 1, 0, 1, 0],
    "exposure_amount": [10000, 25000, 15000, 8000, 12000, 30000, 18000, 22000],
    "predicted_ead": [5000, 12500, 7500, 8000, 12000, 15000, 18000, 11000],
    "actual_ead": [4800, 13000, 7200, 7900, 11800, 14500, 17500, 10800],
    "region": ["North", "North", "South", "South", "East", "East", "West", "West"],
    "product": ["Loan", "Credit", "Loan", "Credit", "Loan", "Credit", "Loan", "Credit"]
})

print("Sample Dataset:")
print(df)

This creates a dataset with:

  • Probability: Model-predicted probability of default (0.0-1.0)
  • Default Flag: Actual default outcome (0=no default, 1=default)
  • Exposure Amounts: Financial exposure values
  • EAD Values: Predicted vs actual exposure at default
  • Segments: Region and product for group analysis

Data Format Compatibility

The TNP Statistic Library is designed to work flexibly with your existing data formats:

Default Indicator Columns

For metrics that require default indicators (accuracy, AUC, etc.), you can use either:

  • Numeric format: Traditional 0/1 values (0 = no default, 1 = default)
  • Boolean format: True/False values (False = no default, True = default)
# Both formats work seamlessly:

# Using traditional 0/1 format
df_numeric = pl.DataFrame({
    "probability": [0.1, 0.8, 0.3],
    "default_flag": [0, 1, 0]  # Numeric indicators
})

# Using boolean format
df_boolean = pl.DataFrame({
    "probability": [0.1, 0.8, 0.3],
    "is_default": [False, True, False]  # Boolean indicators
})

# Both work with all accuracy and discrimination metrics
from tnp_statistic_library.metrics import default_accuracy

# Numeric format
accuracy_numeric = default_accuracy(
    name="accuracy_test",
    dataset=df_numeric,
    data_format="record_level",
    prob_def="probability",
    default="default_flag"  # 0/1 column
)

# Boolean format
accuracy_boolean = default_accuracy(
    name="accuracy_test",
    dataset=df_boolean,
    data_format="record_level",
    prob_def="probability",
    default="is_default"  # True/False column
)

Approach 1: Interactive Function Usage

Perfect for data exploration, Jupyter notebooks, and ad-hoc analysis:

Basic Accuracy Calculation

from tnp_statistic_library.metrics import default_accuracy

# Calculate overall model accuracy
accuracy_result = default_accuracy(
    name="model_validation",
    dataset=df,
    data_format="record_level",
    prob_def="probability",
    default="default_flag"
)

print(f"Model accuracy: {accuracy_result}")

Segmented Analysis

# Calculate accuracy by region
regional_accuracy = default_accuracy(
    name="regional_accuracy",
    dataset=df,
    data_format="record_level",
    prob_def="probability",
    default="default_flag",
    segment=["region"]
)

print(f"Regional accuracy breakdown: {regional_accuracy}")

Multiple Metrics

from tnp_statistic_library.metrics import auc, mean, ead_accuracy

# Calculate discrimination power
auc_result = auc(
    name="discrimination_power",
    dataset=df,
    data_format="record_level",
    prob_def="probability",
    default="default_flag",
    segment=["product"]
)

# Calculate EAD accuracy
ead_result = ead_accuracy(
    name="ead_validation",
    dataset=df,
    data_format="record_level",
    predicted_ead="predicted_ead",
    actual_ead="actual_ead",
    default="default_flag"
)

# Calculate mean exposure by region
exposure_mean = mean(
    name="regional_exposure",
    dataset=df,
    variable="exposure_amount",
    segment=["region"]
)

print(f"AUC by product: {auc_result}")
print(f"EAD accuracy: {ead_result}")
print(f"Mean exposure by region: {exposure_mean}")

Approach 2: YAML Workflow Configuration

Ideal for production pipelines, standardized reporting, and batch processing:

Creating a Configuration File

Create a file called portfolio_metrics.yaml:

datasets:
  portfolio_data:
    location: "portfolio_data.csv"

metrics:
  # Model validation suite
  accuracy_validation:
    metric_type: default_accuracy
    config:
      name: ["overall_accuracy", "regional_accuracy"]
      segment: [null, ["region"]]
      dataset: "portfolio_data"
      data_format: "record_level"
      prob_def: "probability"
      default: "default_flag"

  discrimination_analysis:
    metric_type: auc
    config:
      name: ["product_auc"]
      segment: [["product"]]
      dataset: "portfolio_data"
      data_format: "record_level"
      prob_def: "probability"
      default: "default_flag"

  exposure_summary:
    metric_type: mean
    config:
      name: ["regional_exposure", "product_exposure"]
      segment: [["region"], ["product"]]
      dataset: "portfolio_data"
      variable: "exposure_amount"

  ead_validation:
    metric_type: ead_accuracy
    config:
      name: ["ead_accuracy"]
      dataset: "portfolio_data"
      data_format: "record_level"
      predicted_ead: "predicted_ead"
      actual_ead: "actual_ead"

Executing the Workflow

from tnp_statistic_library.workflows import load_configuration_from_yaml

# First, save your data
df.write_csv("portfolio_data.csv")

# Load and execute all metrics from YAML
config = load_configuration_from_yaml("portfolio_metrics.yaml")
results = config.metrics.collect_all()

# Convert results to DataFrame for analysis
results_df = results.to_dataframe()
print("All Metric Results:")
print(results_df)

# Access individual results
for metric_name, result in results.items():
    print(f"{metric_name}: {result}")

Understanding Fan-out Expansion

The YAML approach supports "fan-out expansion" where lists in configuration fields automatically generate multiple metrics:

metrics:
  multi_segment_analysis:
    metric_type: default_accuracy
    config:
      name: ["overall", "by_region", "by_product"]
      segment: [null, ["region"], ["product"]]
      dataset: "portfolio_data"
      data_format: "record_level"
      prob_def: "probability"
      default: "default_flag"

This single configuration generates three separate metrics:

  1. overall - No segmentation
  2. by_region - Segmented by region
  3. by_product - Segmented by product

When to Use Each Approach

Use Interactive Functions When:

  • Exploring data in Jupyter notebooks
  • Performing ad-hoc analysis
  • Need immediate results and debugging
  • Want full IDE support and type safety
  • Working with dynamic or changing requirements

Use YAML Workflows When:

  • Building production pipelines
  • Need standardized, repeatable analysis
  • Processing multiple datasets with same metrics
  • Want to version control your metric configurations
  • Running batch jobs or scheduled reports
  • Need to generate many related metrics efficiently

Next Steps

  • Examples - See comprehensive examples for all metric types
  • API Reference - Explore all available functions and parameters
  • Workflows Guide - Deep dive into YAML configuration options

Common Patterns

Loading Data from Various Sources

# From CSV
df = pl.read_csv("data.csv")

# From Parquet
df = pl.read_parquet("data.parquet")

# From database (requires connector)
df = pl.read_database("SELECT * FROM portfolio", connection_uri)

# From existing pandas DataFrame
df = pl.from_pandas(pandas_df)

Error Handling

try:
    result = default_accuracy(
        name="test",
        dataset=df,
        data_format="record_level",
        prob_def="probability",
        default="default_flag"
    )
    print(f"Success: {result}")
except ValueError as e:
    print(f"Configuration error: {e}")
except Exception as e:
    print(f"Calculation error: {e}")

Working with Large Datasets

# Use lazy evaluation for memory efficiency
lazy_df = pl.scan_csv("large_dataset.csv")

# Metrics work with lazy DataFrames
result = default_accuracy(
    name="large_dataset_accuracy",
    dataset=lazy_df,
    data_format="record_level",
    prob_def="prob_column",
    default="default_column"
)