Getting Started¶

This tutorial will guide you through setting up and using the TNP Statistic Library with practical examples.

Installation¶

Standard Installation¶

Install the TNP Statistic Library using pip (assumes the package is available in your configured package index):

pip install tnp-statistic-library

Note: If pip is not available directly, you can use python -m pip install tnp-statistic-library instead.

Alternative Installation Methods¶

Installing from a Package Mirror¶

If your organization uses a private package mirror or you need to install from a specific package index:

# Install from a custom package index
pip install --index-url https://your-package-mirror.com/simple/ tnp-statistic-library

# Install from a custom index with fallback to PyPI
pip install --extra-index-url https://your-package-mirror.com/simple/ tnp-statistic-library

# Install with specific trusted host (if using HTTP)
pip install --trusted-host your-package-mirror.com --index-url http://your-package-mirror.com/simple/ tnp-statistic-library

# Alternative: Use python -m pip if pip command is not available
# python -m pip install --index-url https://your-package-mirror.com/simple/ tnp-statistic-library

Installing from a Wheel File¶

If you have downloaded a wheel (.whl) file or need to install from a local wheel:

# Install from a local wheel file
pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl

# Install from a wheel file with dependencies
pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl[all]

# Force reinstall from wheel file
pip install --force-reinstall path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl

# Alternative: Use python -m pip if pip command is not available
# python -m pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl

Building Distribution Files (For Developers)¶

If you're a developer who needs to create distribution files for system administrators:

# Clone the repository
git clone <repository-url>
cd tnp_statistic_library

# Build the distribution packages
uv build
# This creates both wheel (.whl) and source (.tar.gz) files in the dist/ directory

# Alternative: using standard build tool
# python -m build

# Check the created files
ls dist/
# Should show: tnp_statistic_library-X.Y.Z-py3-none-any.whl and tnp-statistic-library-X.Y.Z.tar.gz

Adding to a Package Mirror¶

If you're a system administrator wanting to add this library to your organization's package mirror:

Obtain the distribution files from your development team:
tnp_statistic_library-X.Y.Z-py3-none-any.whl (wheel file)
tnp-statistic-library-X.Y.Z.tar.gz (source distribution)
For DevPI (common Python package mirror):

# Upload to your DevPI index
devpi upload tnp_statistic_library-X.Y.Z-py3-none-any.whl
devpi upload tnp-statistic-library-X.Y.Z.tar.gz

For Nexus Repository Manager:
Upload the wheel and source files through the Nexus web interface
Or use the REST API to programmatically upload packages
For JFrog Artifactory:

# Using JFrog CLI
jf rt upload "tnp_statistic_library-*.whl" pypi-local/tnp-statistic-library/
jf rt upload "tnp-statistic-library-*.tar.gz" pypi-local/tnp-statistic-library/

For simple file-based mirrors:

# Copy packages to your package server
scp tnp_statistic_library-*.whl user@package-server:/var/www/pypi/simple/tnp-statistic-library/
scp tnp-statistic-library-*.tar.gz user@package-server:/var/www/pypi/simple/tnp-statistic-library/

Direct Wheel Distribution¶

For organizations that prefer to distribute wheel files directly without a package mirror:

Obtain the wheel file from your development team:
tnp_statistic_library-X.Y.Z-py3-none-any.whl
Distribute the wheel file:

# Share via internal file server, email, or artifact repository
# Users can then install using:
# pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl

For CI/CD pipelines:

# Store as build artifact for download
# Or publish to internal artifact storage (AWS S3, Azure Blob, etc.)

Verification¶

After installation, verify the library is working correctly:

from tnp_statistic_library.version import VERSION
print(f"TNP Statistic Library version: {VERSION}")

# Test basic functionality
from tnp_statistic_library.metrics import default_accuracy
print("Installation successful!")

Your First Example¶

Let's start with a complete example that demonstrates both approaches to using the library.

Creating Sample Data¶

First, we'll create a realistic financial dataset using Polars:

import polars as pl

# Create a sample portfolio dataset
df = pl.DataFrame({
    "customer_id": [f"CUST_{i:04d}" for i in range(1, 9)],
    "probability": [0.05, 0.15, 0.35, 0.60, 0.80, 0.25, 0.45, 0.10],
    "default_flag": [0, 0, 0, 1, 1, 0, 1, 0],
    "exposure_amount": [10000, 25000, 15000, 8000, 12000, 30000, 18000, 22000],
    "predicted_ead": [5000, 12500, 7500, 8000, 12000, 15000, 18000, 11000],
    "actual_ead": [4800, 13000, 7200, 7900, 11800, 14500, 17500, 10800],
    "region": ["North", "North", "South", "South", "East", "East", "West", "West"],
    "product": ["Loan", "Credit", "Loan", "Credit", "Loan", "Credit", "Loan", "Credit"]
})

print("Sample Dataset:")
print(df)

This creates a dataset with:

Probability: Model-predicted probability of default (0.0-1.0)
Default Flag: Actual default outcome (0=no default, 1=default)
Exposure Amounts: Financial exposure values
EAD Values: Predicted vs actual exposure at default
Segments: Region and product for group analysis

Data Format Compatibility¶

The TNP Statistic Library is designed to work flexibly with your existing data formats:

Default Indicator Columns¶

For metrics that require default indicators (accuracy, AUC, etc.), you can use either:

Numeric format: Traditional 0/1 values (0 = no default, 1 = default)
Boolean format: True/False values (False = no default, True = default)

# Both formats work seamlessly:

# Using traditional 0/1 format
df_numeric = pl.DataFrame({
    "probability": [0.1, 0.8, 0.3],
    "default_flag": [0, 1, 0]  # Numeric indicators
})

# Using boolean format
df_boolean = pl.DataFrame({
    "probability": [0.1, 0.8, 0.3],
    "is_default": [False, True, False]  # Boolean indicators
})

# Both work with all accuracy and discrimination metrics
from tnp_statistic_library.metrics import default_accuracy

# Numeric format
accuracy_numeric = default_accuracy(
    name="accuracy_test",
    dataset=df_numeric,
    data_format="record_level",
    prob_def="probability",
    default="default_flag"  # 0/1 column
)

# Boolean format
accuracy_boolean = default_accuracy(
    name="accuracy_test",
    dataset=df_boolean,
    data_format="record_level",
    prob_def="probability",
    default="is_default"  # True/False column
)

Approach 1: Interactive Function Usage¶

Perfect for data exploration, Jupyter notebooks, and ad-hoc analysis:

Basic Accuracy Calculation¶

from tnp_statistic_library.metrics import default_accuracy

# Calculate overall model accuracy
accuracy_result = default_accuracy(
    name="model_validation",
    dataset=df,
    data_format="record_level",
    prob_def="probability",
    default="default_flag"
)

print(f"Model accuracy: {accuracy_result}")

Segmented Analysis¶

# Calculate accuracy by region
regional_accuracy = default_accuracy(
    name="regional_accuracy",
    dataset=df,
    data_format="record_level",
    prob_def="probability",
    default="default_flag",
    segment=["region"]
)

print(f"Regional accuracy breakdown: {regional_accuracy}")

Multiple Metrics¶

from tnp_statistic_library.metrics import auc, mean, ead_accuracy

# Calculate discrimination power
auc_result = auc(
    name="discrimination_power",
    dataset=df,
    data_format="record_level",
    prob_def="probability",
    default="default_flag",
    segment=["product"]
)

# Calculate EAD accuracy
ead_result = ead_accuracy(
    name="ead_validation",
    dataset=df,
    data_format="record_level",
    predicted_ead="predicted_ead",
    actual_ead="actual_ead",
    default="default_flag"
)

# Calculate mean exposure by region
exposure_mean = mean(
    name="regional_exposure",
    dataset=df,
    variable="exposure_amount",
    segment=["region"]
)

print(f"AUC by product: {auc_result}")
print(f"EAD accuracy: {ead_result}")
print(f"Mean exposure by region: {exposure_mean}")

Approach 2: YAML Workflow Configuration¶

Ideal for production pipelines, standardized reporting, and batch processing:

Creating a Configuration File¶

Create a file called portfolio_metrics.yaml:

datasets:
  portfolio_data:
    location: "portfolio_data.csv"

metrics:
  # Model validation suite
  accuracy_validation:
    metric_type: default_accuracy
    config:
      name: ["overall_accuracy", "regional_accuracy"]
      segment: [null, ["region"]]
      dataset: "portfolio_data"
      data_format: "record_level"
      prob_def: "probability"
      default: "default_flag"

  discrimination_analysis:
    metric_type: auc
    config:
      name: ["product_auc"]
      segment: [["product"]]
      dataset: "portfolio_data"
      data_format: "record_level"
      prob_def: "probability"
      default: "default_flag"

  exposure_summary:
    metric_type: mean
    config:
      name: ["regional_exposure", "product_exposure"]
      segment: [["region"], ["product"]]
      dataset: "portfolio_data"
      variable: "exposure_amount"

  ead_validation:
    metric_type: ead_accuracy
    config:
      name: ["ead_accuracy"]
      dataset: "portfolio_data"
      data_format: "record_level"
      predicted_ead: "predicted_ead"
      actual_ead: "actual_ead"

Executing the Workflow¶

from tnp_statistic_library.workflows import load_configuration_from_yaml

# First, save your data
df.write_csv("portfolio_data.csv")

# Load and execute all metrics from YAML
config = load_configuration_from_yaml("portfolio_metrics.yaml")
results = config.metrics.collect_all()

# Convert results to DataFrame for analysis
results_df = results.to_dataframe()
print("All Metric Results:")
print(results_df)

# Access individual results
for metric_name, result in results.items():
    print(f"{metric_name}: {result}")

Understanding Fan-out Expansion¶

The YAML approach supports "fan-out expansion" where lists in configuration fields automatically generate multiple metrics:

metrics:
  multi_segment_analysis:
    metric_type: default_accuracy
    config:
      name: ["overall", "by_region", "by_product"]
      segment: [null, ["region"], ["product"]]
      dataset: "portfolio_data"
      data_format: "record_level"
      prob_def: "probability"
      default: "default_flag"

This single configuration generates three separate metrics:

overall - No segmentation
by_region - Segmented by region
by_product - Segmented by product

When to Use Each Approach¶

Use Interactive Functions When:¶

Exploring data in Jupyter notebooks
Performing ad-hoc analysis
Need immediate results and debugging
Want full IDE support and type safety
Working with dynamic or changing requirements

Use YAML Workflows When:¶

Building production pipelines
Need standardized, repeatable analysis
Processing multiple datasets with same metrics
Want to version control your metric configurations
Running batch jobs or scheduled reports
Need to generate many related metrics efficiently

Next Steps¶

Examples - See comprehensive examples for all metric types
API Reference - Explore all available functions and parameters
Workflows Guide - Deep dive into YAML configuration options

Common Patterns¶

Loading Data from Various Sources¶

# From CSV
df = pl.read_csv("data.csv")

# From Parquet
df = pl.read_parquet("data.parquet")

# From database (requires connector)
df = pl.read_database("SELECT * FROM portfolio", connection_uri)

# From existing pandas DataFrame
df = pl.from_pandas(pandas_df)

Error Handling¶

try:
    result = default_accuracy(
        name="test",
        dataset=df,
        data_format="record_level",
        prob_def="probability",
        default="default_flag"
    )
    print(f"Success: {result}")
except ValueError as e:
    print(f"Configuration error: {e}")
except Exception as e:
    print(f"Calculation error: {e}")

Working with Large Datasets¶

# Use lazy evaluation for memory efficiency
lazy_df = pl.scan_csv("large_dataset.csv")

# Metrics work with lazy DataFrames
result = default_accuracy(
    name="large_dataset_accuracy",
    dataset=lazy_df,
    data_format="record_level",
    prob_def="prob_column",
    default="default_column"
)