Getting Started¶
This tutorial will guide you through setting up and using the TNP Statistic Library with practical examples.
Installation¶
Standard Installation¶
Install the TNP Statistic Library using pip (assumes the package is available in your configured package index):
Note: If
pipis not available directly, you can usepython -m pip install tnp-statistic-libraryinstead.
Alternative Installation Methods¶
Installing from a Package Mirror¶
If your organization uses a private package mirror or you need to install from a specific package index:
# Install from a custom package index
pip install --index-url https://your-package-mirror.com/simple/ tnp-statistic-library
# Install from a custom index with fallback to PyPI
pip install --extra-index-url https://your-package-mirror.com/simple/ tnp-statistic-library
# Install with specific trusted host (if using HTTP)
pip install --trusted-host your-package-mirror.com --index-url http://your-package-mirror.com/simple/ tnp-statistic-library
# Alternative: Use python -m pip if pip command is not available
# python -m pip install --index-url https://your-package-mirror.com/simple/ tnp-statistic-library
Installing from a Wheel File¶
If you have downloaded a wheel (.whl) file or need to install from a local wheel:
# Install from a local wheel file
pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl
# Install from a wheel file with dependencies
pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl[all]
# Force reinstall from wheel file
pip install --force-reinstall path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl
# Alternative: Use python -m pip if pip command is not available
# python -m pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl
Building Distribution Files (For Developers)¶
If you're a developer who needs to create distribution files for system administrators:
# Clone the repository
git clone <repository-url>
cd tnp_statistic_library
# Build the distribution packages
uv build
# This creates both wheel (.whl) and source (.tar.gz) files in the dist/ directory
# Alternative: using standard build tool
# python -m build
# Check the created files
ls dist/
# Should show: tnp_statistic_library-X.Y.Z-py3-none-any.whl and tnp-statistic-library-X.Y.Z.tar.gz
Adding to a Package Mirror¶
If you're a system administrator wanting to add this library to your organization's package mirror:
-
Obtain the distribution files from your development team:
-
tnp_statistic_library-X.Y.Z-py3-none-any.whl(wheel file) -
tnp-statistic-library-X.Y.Z.tar.gz(source distribution) -
For DevPI (common Python package mirror):
# Upload to your DevPI index
devpi upload tnp_statistic_library-X.Y.Z-py3-none-any.whl
devpi upload tnp-statistic-library-X.Y.Z.tar.gz
-
For Nexus Repository Manager:
-
Upload the wheel and source files through the Nexus web interface
-
Or use the REST API to programmatically upload packages
-
For JFrog Artifactory:
# Using JFrog CLI
jf rt upload "tnp_statistic_library-*.whl" pypi-local/tnp-statistic-library/
jf rt upload "tnp-statistic-library-*.tar.gz" pypi-local/tnp-statistic-library/
- For simple file-based mirrors:
Direct Wheel Distribution¶
For organizations that prefer to distribute wheel files directly without a package mirror:
-
Obtain the wheel file from your development team:
-
tnp_statistic_library-X.Y.Z-py3-none-any.whl -
Distribute the wheel file:
# Share via internal file server, email, or artifact repository
# Users can then install using:
# pip install path/to/tnp_statistic_library-X.Y.Z-py3-none-any.whl
- For CI/CD pipelines:
Verification¶
After installation, verify the library is working correctly:
from tnp_statistic_library.version import VERSION
print(f"TNP Statistic Library version: {VERSION}")
# Test basic functionality
from tnp_statistic_library.metrics import default_accuracy
print("Installation successful!")
Your First Example¶
Let's start with a complete example that demonstrates both approaches to using the library.
Creating Sample Data¶
First, we'll create a realistic financial dataset using Polars:
import polars as pl
# Create a sample portfolio dataset
df = pl.DataFrame({
"customer_id": [f"CUST_{i:04d}" for i in range(1, 9)],
"probability": [0.05, 0.15, 0.35, 0.60, 0.80, 0.25, 0.45, 0.10],
"default_flag": [0, 0, 0, 1, 1, 0, 1, 0],
"exposure_amount": [10000, 25000, 15000, 8000, 12000, 30000, 18000, 22000],
"predicted_ead": [5000, 12500, 7500, 8000, 12000, 15000, 18000, 11000],
"actual_ead": [4800, 13000, 7200, 7900, 11800, 14500, 17500, 10800],
"region": ["North", "North", "South", "South", "East", "East", "West", "West"],
"product": ["Loan", "Credit", "Loan", "Credit", "Loan", "Credit", "Loan", "Credit"]
})
print("Sample Dataset:")
print(df)
This creates a dataset with:
- Probability: Model-predicted probability of default (0.0-1.0)
- Default Flag: Actual default outcome (0=no default, 1=default)
- Exposure Amounts: Financial exposure values
- EAD Values: Predicted vs actual exposure at default
- Segments: Region and product for group analysis
Data Format Compatibility¶
The TNP Statistic Library is designed to work flexibly with your existing data formats:
Default Indicator Columns¶
For metrics that require default indicators (accuracy, AUC, etc.), you can use either:
- Numeric format: Traditional 0/1 values (0 = no default, 1 = default)
- Boolean format: True/False values (False = no default, True = default)
# Both formats work seamlessly:
# Using traditional 0/1 format
df_numeric = pl.DataFrame({
"probability": [0.1, 0.8, 0.3],
"default_flag": [0, 1, 0] # Numeric indicators
})
# Using boolean format
df_boolean = pl.DataFrame({
"probability": [0.1, 0.8, 0.3],
"is_default": [False, True, False] # Boolean indicators
})
# Both work with all accuracy and discrimination metrics
from tnp_statistic_library.metrics import default_accuracy
# Numeric format
accuracy_numeric = default_accuracy(
name="accuracy_test",
dataset=df_numeric,
data_format="record_level",
prob_def="probability",
default="default_flag" # 0/1 column
)
# Boolean format
accuracy_boolean = default_accuracy(
name="accuracy_test",
dataset=df_boolean,
data_format="record_level",
prob_def="probability",
default="is_default" # True/False column
)
Approach 1: Interactive Function Usage¶
Perfect for data exploration, Jupyter notebooks, and ad-hoc analysis:
Basic Accuracy Calculation¶
from tnp_statistic_library.metrics import default_accuracy
# Calculate overall model accuracy
accuracy_result = default_accuracy(
name="model_validation",
dataset=df,
data_format="record_level",
prob_def="probability",
default="default_flag"
)
print(f"Model accuracy: {accuracy_result}")
Segmented Analysis¶
# Calculate accuracy by region
regional_accuracy = default_accuracy(
name="regional_accuracy",
dataset=df,
data_format="record_level",
prob_def="probability",
default="default_flag",
segment=["region"]
)
print(f"Regional accuracy breakdown: {regional_accuracy}")
Multiple Metrics¶
from tnp_statistic_library.metrics import auc, mean, ead_accuracy
# Calculate discrimination power
auc_result = auc(
name="discrimination_power",
dataset=df,
data_format="record_level",
prob_def="probability",
default="default_flag",
segment=["product"]
)
# Calculate EAD accuracy
ead_result = ead_accuracy(
name="ead_validation",
dataset=df,
data_format="record_level",
predicted_ead="predicted_ead",
actual_ead="actual_ead",
default="default_flag"
)
# Calculate mean exposure by region
exposure_mean = mean(
name="regional_exposure",
dataset=df,
variable="exposure_amount",
segment=["region"]
)
print(f"AUC by product: {auc_result}")
print(f"EAD accuracy: {ead_result}")
print(f"Mean exposure by region: {exposure_mean}")
Approach 2: YAML Workflow Configuration¶
Ideal for production pipelines, standardized reporting, and batch processing:
Creating a Configuration File¶
Create a file called portfolio_metrics.yaml:
datasets:
portfolio_data:
location: "portfolio_data.csv"
metrics:
# Model validation suite
accuracy_validation:
metric_type: default_accuracy
config:
name: ["overall_accuracy", "regional_accuracy"]
segment: [null, ["region"]]
dataset: "portfolio_data"
data_format: "record_level"
prob_def: "probability"
default: "default_flag"
discrimination_analysis:
metric_type: auc
config:
name: ["product_auc"]
segment: [["product"]]
dataset: "portfolio_data"
data_format: "record_level"
prob_def: "probability"
default: "default_flag"
exposure_summary:
metric_type: mean
config:
name: ["regional_exposure", "product_exposure"]
segment: [["region"], ["product"]]
dataset: "portfolio_data"
variable: "exposure_amount"
ead_validation:
metric_type: ead_accuracy
config:
name: ["ead_accuracy"]
dataset: "portfolio_data"
data_format: "record_level"
predicted_ead: "predicted_ead"
actual_ead: "actual_ead"
Executing the Workflow¶
from tnp_statistic_library.workflows import load_configuration_from_yaml
# First, save your data
df.write_csv("portfolio_data.csv")
# Load and execute all metrics from YAML
config = load_configuration_from_yaml("portfolio_metrics.yaml")
results = config.metrics.collect_all()
# Convert results to DataFrame for analysis
results_df = results.to_dataframe()
print("All Metric Results:")
print(results_df)
# Access individual results
for metric_name, result in results.items():
print(f"{metric_name}: {result}")
Understanding Fan-out Expansion¶
The YAML approach supports "fan-out expansion" where lists in configuration fields automatically generate multiple metrics:
metrics:
multi_segment_analysis:
metric_type: default_accuracy
config:
name: ["overall", "by_region", "by_product"]
segment: [null, ["region"], ["product"]]
dataset: "portfolio_data"
data_format: "record_level"
prob_def: "probability"
default: "default_flag"
This single configuration generates three separate metrics:
overall- No segmentationby_region- Segmented by regionby_product- Segmented by product
When to Use Each Approach¶
Use Interactive Functions When:¶
- Exploring data in Jupyter notebooks
- Performing ad-hoc analysis
- Need immediate results and debugging
- Want full IDE support and type safety
- Working with dynamic or changing requirements
Use YAML Workflows When:¶
- Building production pipelines
- Need standardized, repeatable analysis
- Processing multiple datasets with same metrics
- Want to version control your metric configurations
- Running batch jobs or scheduled reports
- Need to generate many related metrics efficiently
Next Steps¶
- Examples - See comprehensive examples for all metric types
- API Reference - Explore all available functions and parameters
- Workflows Guide - Deep dive into YAML configuration options
Common Patterns¶
Loading Data from Various Sources¶
# From CSV
df = pl.read_csv("data.csv")
# From Parquet
df = pl.read_parquet("data.parquet")
# From database (requires connector)
df = pl.read_database("SELECT * FROM portfolio", connection_uri)
# From existing pandas DataFrame
df = pl.from_pandas(pandas_df)
Error Handling¶
try:
result = default_accuracy(
name="test",
dataset=df,
data_format="record_level",
prob_def="probability",
default="default_flag"
)
print(f"Success: {result}")
except ValueError as e:
print(f"Configuration error: {e}")
except Exception as e:
print(f"Calculation error: {e}")