Metrics by Model Type¶
This guide explains which metrics are applicable for different model types in credit risk modeling: Probability of Default (PD), Exposure at Default (EAD), and Loss Given Default (LGD).
Overview¶
Different credit risk model types require different validation approaches and metrics. The TNP Statistic Library provides specialized metrics tailored to each model type's specific characteristics and requirements.
Probability of Default (PD) Models¶
PD models predict the probability that a borrower will default within a specific time horizon (typically 12 months). These models output probabilities between 0.0 and 1.0.
Applicable Metrics¶
Accuracy & Calibration¶
- Default Accuracy - Compares predicted probabilities to observed default rates
- Hosmer-Lemeshow Test - Tests goodness-of-fit for probability calibration
- Jeffreys Test - Bayesian approach to assess probability calibration
- Binomial Test - Statistical test for default rate validation against expected probabilities
Discrimination & Ranking¶
- AUC - Area Under the ROC Curve for ranking ability
- Gini Coefficient - Alternative discrimination measure (Gini = 2×AUC - 1)
- Kolmogorov-Smirnov - Maximum separation between distributions
- F1 Score - Balanced harmonic mean of precision and recall
- F2 Score - Recall-weighted harmonic mean (emphasizes recall over precision)
⚠️ Note: When using F1/F2 scores with summary-level data, results may be skewed since the classification threshold is applied at the bucket level rather than individual record level.
Stability¶
- Population Stability Index - Monitor distribution shifts over time
Summary Statistics¶
Data Requirements for PD Models¶
Record-Level Data¶
# Required columns for PD model validation
{
"prob_def": [0.05, 0.12, 0.08, 0.25, 0.15], # Predicted probabilities (0.0-1.0)
"default": [0, 1, 0, 1, 0], # Default indicators (0/1 or boolean)
"segment": ["A", "A", "B", "B", "C"] # Optional: grouping variables
}
Summary-Level Data¶
# Pre-aggregated data by risk grades
{
"mean_pd": [0.02, 0.05, 0.12, 0.25, 0.45], # Mean probability by grade
"defaults": [5, 12, 28, 45, 67], # Default counts
"volume": [1000, 800, 600, 400, 200], # Total observations
"risk_grade": ["AAA", "AA", "A", "BBB", "BB"] # Risk grades
}
Exposure at Default (EAD) Models¶
EAD models predict the outstanding exposure amount at the time of default for credit facilities (lines of credit, credit cards, etc.). These models typically predict monetary amounts.
Applicable Metrics¶
Accuracy¶
- EAD Accuracy - Specialized metric for EAD prediction accuracy
- MAPE - Mean Absolute Percentage Error for scale-independent accuracy assessment
- RMSE - Root Mean Squared Error for continuous predictions
- T-Test - Statistical test for systematic bias in EAD predictions
Summary Statistics¶
Stability¶
- Population Stability Index - Monitor EAD distribution shifts
Data Requirements for EAD Models¶
Record-Level Data¶
# Required columns for EAD model validation
{
"predicted_ead": [5000, 12000, 8000, 15000], # Predicted exposure amounts
"actual_ead": [4800, 13000, 7200, 14500], # Actual exposure at default
"default": [1, 1, 1, 1], # Must be 1 (defaulted accounts only)
"segment": ["Credit Card", "LOC", "Credit Card", "LOC"] # Optional grouping
}
Summary-Level Data¶
# Pre-aggregated defaulted account data
{
"predicted_ead": [8500, 12000, 15500], # Mean predicted EAD by group
"actual_ead": [8200, 11800, 15200], # Mean actual EAD by group
"defaults": [45, 32, 28], # Number of defaults in each group
"product_type": ["Credit Card", "LOC", "Term Loan"]
}
Important Notes for EAD Models¶
- Defaulted Accounts Only: EAD accuracy can only be calculated for accounts that actually defaulted
- Positive Values: EAD amounts must be positive (negative values not allowed)
- Currency Consistency: Ensure predicted and actual amounts are in the same units
- Facility Types: Different facility types may require separate validation
Loss Given Default (LGD) Models¶
LGD models predict the percentage of exposure that will be lost if default occurs. These models typically output percentages between 0.0 and 1.0 (or 0% to 100%).
Applicable Metrics¶
Accuracy¶
- MAPE - Mean Absolute Percentage Error for scale-independent LGD accuracy assessment
- RMSE - Root Mean Squared Error for continuous LGD predictions
- Default Accuracy - Can be adapted for LGD rate validation
- T-Test - Statistical test for systematic bias in LGD predictions
Summary Statistics¶
Stability¶
- Population Stability Index - Monitor LGD distribution shifts
Data Requirements for LGD Models¶
Record-Level Data¶
# Required columns for LGD model validation
{
"predicted_lgd": [0.45, 0.32, 0.58, 0.41], # Predicted LGD rates (0.0-1.0)
"actual_lgd": [0.42, 0.35, 0.55, 0.38], # Actual LGD rates observed
"default": [1, 1, 1, 1], # Must be 1 (defaulted accounts only)
"segment": ["Secured", "Unsecured", "Secured", "Unsecured"]
}
Summary-Level Data¶
# Pre-aggregated defaulted account data
{
"predicted_lgd": [0.35, 0.55, 0.48], # Mean predicted LGD by group
"actual_lgd": [0.32, 0.52, 0.45], # Mean actual LGD by group
"defaults": [120, 85, 95], # Number of defaults in each group
"collateral_type": ["Secured", "Unsecured", "Semi-Secured"]
}
Important Notes for LGD Models¶
- Defaulted Accounts Only: LGD can only be observed and validated for accounts that defaulted
- Recovery Period: Consider the time horizon for recovery processes
- Collateral Types: Different collateral types typically have different LGD characteristics
- Economic Conditions: LGD can vary significantly with economic cycles
Cross-Model Metrics¶
Some metrics can be used across multiple model types:
Universal Metrics¶
- Population Stability Index - Monitor any distribution shifts
- Mean - Calculate means for any numeric variable
- Median - Calculate medians for any numeric variable
- MAPE - Scale-independent error measurement for any continuous predictions
- RMSE - Error measurement for any continuous predictions
- T-Test - Statistical significance testing for model bias detection and comparison
- Binomial Test - Statistical test for default rate validation (applicable to any binary outcome)
- Shapiro-Wilk Test - Normality testing for statistical assumptions validation
Model-Specific Metrics¶
| Model Type | Specialized Metrics | Primary Use Case |
|---|---|---|
| PD | Default Accuracy, Hosmer-Lemeshow, Jeffreys Test, Binomial Test, AUC, Gini, KS, F1 Score, F2 Score, T-Test | Probability calibration and discrimination |
| EAD | EAD Accuracy, MAPE, T-Test | Exposure amount prediction accuracy |
| LGD | MAPE, RMSE, T-Test | Loss rate prediction accuracy |
Best Practices¶
PD Model Validation¶
- Calibration First: Use Hosmer-Lemeshow or Jeffreys Test to assess probability calibration
- Discrimination Second: Use AUC/Gini to measure ranking ability
- Segmented Analysis: Validate across different risk segments and time periods
- Stability Monitoring: Use PSI to monitor model performance over time
EAD Model Validation¶
- Accuracy Focus: EAD Accuracy is the primary metric for exposure predictions
- Facility-Specific: Validate separately by facility type (credit cards, lines of credit, etc.)
- Economic Conditions: Consider validation across different economic scenarios
- Currency Consistency: Ensure all amounts are in consistent units
LGD Model Validation¶
- Continuous Validation: Use RMSE for overall prediction accuracy
- Collateral-Specific: Validate separately by collateral type
- Recovery Timeline: Consider different recovery periods in validation
- Economic Sensitivity: Validate across different economic conditions
Example Validation Workflows¶
Complete PD Model Validation¶
from tnp_statistic_library.metrics import (
default_accuracy, hosmer_lemeshow, binomial_test, auc, f1_score, f2_score,
population_stability_index, ttest, shapiro_wilk
)
# Calibration assessment
calibration = hosmer_lemeshow(
name="pd_calibration",
dataset=validation_data,
data_format="record_level",
prob_def="predicted_pd",
default="default_flag"
)
# Default rate validation
default_rate_test = binomial_test(
name="pd_default_rate_validation",
dataset=validation_data,
data_format="record_level",
default="default_flag",
expected_probability=0.05, # Expected 5% default rate
segment=["risk_grade"]
)
# Discrimination assessment
discrimination = auc(
name="pd_discrimination",
dataset=validation_data,
data_format="record_level",
prob_def="predicted_pd",
default="default_flag"
)
# Classification performance (F1 balanced, F2 recall-focused)
f1_performance = f1_score(
name="pd_f1_classification",
dataset=validation_data,
data_format="record_level",
prob_def="predicted_pd",
default="default_flag",
threshold=0.5
)
f2_performance = f2_score(
name="pd_f2_classification",
dataset=validation_data,
data_format="record_level",
prob_def="predicted_pd",
default="default_flag",
threshold=0.3 # Lower threshold for better recall
)
# Stability monitoring
stability = population_stability_index(
name="pd_stability",
dataset=combined_data, # Dataset with both baseline and current data
data_format="record_level",
band_column="pd_band", # Pre-defined probability bands
baseline_column="is_baseline", # Indicator column (1 for baseline period)
current_column="is_current" # Indicator column (1 for current period)
)
# Bias testing
bias_test = ttest(
name="pd_bias_test",
dataset=validation_data,
data_format="record_level",
observed="default_flag", # Convert to differences if needed
predicted="predicted_pd",
null_hypothesis_mean=0.0
)
# Normality testing for model residuals (required for t-test validity)
residual_normality = shapiro_wilk(
name="pd_residual_normality",
dataset=validation_data,
data_format="record_level",
data_column="model_residuals", # Pre-computed residuals
segment=["risk_grade"]
)
Complete EAD Model Validation¶
from tnp_statistic_library.metrics import ead_accuracy, mape, rmse, ttest, shapiro_wilk
# Primary EAD accuracy metric
ead_acc = ead_accuracy(
name="ead_validation",
dataset=defaulted_accounts,
data_format="record_level",
predicted_ead="predicted_exposure",
actual_ead="actual_exposure",
default="default_flag",
segment=["facility_type"]
)
# Additional error measurements
ead_mape = mape(
name="ead_mape",
dataset=defaulted_accounts,
data_format="record_level",
observed="actual_exposure",
predicted="predicted_exposure"
)
ead_rmse = rmse(
name="ead_rmse",
dataset=defaulted_accounts,
data_format="record_level",
observed="actual_exposure",
predicted="predicted_exposure"
)
# EAD bias testing
ead_bias_test = ttest(
name="ead_bias_test",
dataset=defaulted_accounts,
data_format="record_level",
observed="actual_exposure",
predicted="predicted_exposure",
null_hypothesis_mean=0.0
)
# Normality testing for EAD residuals (validates t-test assumptions)
ead_residual_normality = shapiro_wilk(
name="ead_residual_normality",
dataset=defaulted_accounts,
data_format="record_level",
data_column="ead_residuals", # Pre-computed residuals
segment=["facility_type"]
)
Complete LGD Model Validation¶
from tnp_statistic_library.metrics import mape, rmse, mean, median, ttest, shapiro_wilk
# Primary LGD error measurements
lgd_mape = mape(
name="lgd_mape",
dataset=defaulted_accounts,
data_format="record_level",
observed="actual_lgd",
predicted="predicted_lgd",
segment=["collateral_type"]
)
lgd_rmse = rmse(
name="lgd_rmse",
dataset=defaulted_accounts,
data_format="record_level",
observed="actual_lgd",
predicted="predicted_lgd",
segment=["collateral_type"]
)
# LGD summary statistics by segment
lgd_means = mean(
name="lgd_mean_by_collateral",
dataset=defaulted_accounts,
variable="predicted_lgd",
segment=["collateral_type"]
)
# LGD bias testing
lgd_bias_test = ttest(
name="lgd_bias_test",
dataset=defaulted_accounts,
data_format="record_level",
observed="actual_lgd",
predicted="predicted_lgd",
null_hypothesis_mean=0.0,
segment=["collateral_type"]
)
# Normality testing for LGD residuals (validates t-test assumptions)
lgd_residual_normality = shapiro_wilk(
name="lgd_residual_normality",
dataset=defaulted_accounts,
data_format="record_level",
data_column="lgd_residuals", # Pre-computed residuals
segment=["collateral_type"]
)
This documentation provides a comprehensive guide to selecting appropriate metrics for each credit risk model type, ensuring proper validation and monitoring of model performance.