Hosmer-Lemeshow Test Metric¶
The hosmer_lemeshow metric performs a goodness-of-fit test for logistic regression models by comparing observed and expected frequencies across probability bands.
Metric Type: hosmer_lemeshow
Test Overview¶
The Hosmer-Lemeshow test evaluates model calibration by:
- Dividing data into probability bands (default: 10 bands)
- Comparing observed vs expected default frequencies in each band
- Computing a chi-square test statistic
- Calculating a p-value for the goodness-of-fit
Low p-values (< 0.05) suggest poor model calibration.
Configuration Fields¶
Record-Level Data Format¶
For individual loan/account records:
metrics:
calibration_test:
metric_type: "hosmer_lemeshow"
config:
name: ["model_calibration"]
data_format: "record_level"
prob_def: "predicted_probability" # Column with predicted probabilities (0.0-1.0)
default: "default_flag" # Column with default indicators (0/1 or boolean)
bands: 10 # Number of probability bands (default: 10)
segment: [["product_type"]] # Optional: segmentation columns
dataset: "loan_portfolio"
Summary-Level Data Format¶
For pre-aggregated data:
metrics:
summary_calibration:
metric_type: "hosmer_lemeshow"
config:
name: ["aggregated_calibration"]
data_format: "summary_level"
mean_pd: "avg_probability" # Column with mean probabilities (0.0-1.0)
defaults: "default_count" # Column with default counts
volume: "total_count" # Column with total observation counts
bands: 8 # Number of probability bands
segment: [["risk_grade"]] # Optional: segmentation columns
dataset: "risk_summary"
Required Fields by Format¶
Record-Level Required¶
name: Metric name(s)data_format: Must be "record_level"prob_def: Probability column namedefault: Default indicator column namedataset: Dataset reference
Summary-Level Required¶
name: Metric name(s)data_format: Must be "summary_level"mean_pd: Mean probability column namedefaults: Default count column namevolume: Volume count column namedataset: Dataset reference
Optional Fields¶
segment: List of column names for groupingbands: Number of probability bands (default: 10, minimum: 2)
Output Columns¶
The metric produces the following output columns:
group_key: Segmentation group identifier (struct of segment values)volume: Total number of observationsdefaults: Total number of defaultspd: Mean predicted default probabilitybands: Number of bands used in the testhl_statistic: Hosmer-Lemeshow chi-square test statisticpvalue: P-value for the test (lower indicates worse calibration)
Fan-out Examples¶
Multiple Calibration Tests¶
metrics:
calibration_suite:
metric_type: "hosmer_lemeshow"
config:
name:
["overall_calibration", "product_calibration", "vintage_calibration"]
segment: [null, ["product_type"], ["origination_year"]]
data_format: "record_level"
prob_def: "model_probability"
default: "default_indicator"
bands: 10
dataset: "validation_data"
This creates three calibration tests:
- Overall model calibration
- Calibration by product type
- Calibration by origination vintage
Different Band Configurations¶
metrics:
band_sensitivity:
metric_type: "hosmer_lemeshow"
config:
name: ["hl_5_bands", "hl_10_bands", "hl_20_bands"]
segment: [null, null, null]
data_format: "record_level"
prob_def: "risk_score"
default: "default_flag"
bands: 5 # Note: This applies to all - would need separate configs for different bands
dataset: "sensitivity_test_data"
For different band counts, you need separate metric configurations:
metrics:
hl_5_bands:
metric_type: "hosmer_lemeshow"
config:
name: ["calibration_5_bands"]
bands: 5
data_format: "record_level"
prob_def: "risk_score"
default: "default_flag"
dataset: "test_data"
hl_10_bands:
metric_type: "hosmer_lemeshow"
config:
name: ["calibration_10_bands"]
bands: 10
data_format: "record_level"
prob_def: "risk_score"
default: "default_flag"
dataset: "test_data"
Data Requirements¶
Record-Level Data¶
- One row per loan/account
- Probability column: numeric values between 0.0 and 1.0
- Default column: binary values (0/1 or boolean)
- Sufficient data for meaningful test (recommended: at least 10 × bands observations)
Summary-Level Data¶
- One row per group/segment
- Mean probabilities: numeric values between 0.0 and 1.0
- Default counts: positive numbers or None (negative values not allowed)
- Volume counts: positive numbers or None (negative values not allowed)
- Adequate sample sizes for chi-square test validity
Interpretation Guidelines¶
P-value Interpretation¶
- p > 0.05: Model calibration is acceptable (fail to reject good fit)
- p ≤ 0.05: Model calibration is questionable (reject good fit)
- p ≤ 0.01: Model calibration is poor (strong evidence against good fit)
Test Statistic¶
- Higher hl_statistic values indicate worse calibration
- Statistic follows chi-square distribution with (bands - 2) degrees of freedom
Important Notes¶
- Sample Size: Test requires adequate sample size for reliable results
- Band Selection: More bands provide finer resolution but require larger samples
- Multiple Testing: When testing multiple segments, consider multiple testing corrections
- Data Quality: Remove observations with missing or invalid probability scores
- Model Assumptions: Test assumes the logistic regression model form is appropriate