Hosmer-Lemeshow Test Metric¶
The hosmer_lemeshow metric performs a goodness-of-fit test for logistic regression models by comparing observed and expected frequencies across probability bands.
Metric Type: hosmer_lemeshow
Test Overview¶
The Hosmer-Lemeshow test evaluates model calibration by:
- Dividing data into probability bands (default: 10 bands)
- Comparing observed vs expected default frequencies in each band
- Computing a chi-square test statistic
- Calculating a p-value for the goodness-of-fit
Low p-values (< 0.05) suggest poor model calibration.
Configuration Fields¶
Record-Level Data Format¶
For individual loan/account records:
collections:
calibration_test:
metrics:
- name:
- model_calibration
data_format: record
prob_def: predicted_probability
default: default_flag
bands: 10
segment:
- - product_type
metric_type: hosmer_lemeshow
dataset: loan_portfolio
Summary-Level Data Format¶
For pre-aggregated data:
collections:
summary_calibration:
metrics:
- name:
- aggregated_calibration
data_format: summary
mean_pd: avg_probability
defaults: default_count
volume: total_count
bands: 8
segment:
- - risk_grade
metric_type: hosmer_lemeshow
dataset: risk_summary
Required Fields by Format¶
Record-Level Required¶
name: Metric name(s)data_format: Must be "record"prob_def: Probability column namedefault: Default indicator column namedataset: Dataset reference
Summary-Level Required¶
name: Metric name(s)data_format: Must be "summary"mean_pd: Mean probability column namedefaults: Default count column namevolume: Volume count column namedataset: Dataset reference
Optional Fields¶
segment: List of column names for groupingbands: Number of probability bands (default: 10, minimum: 2)
Output Columns¶
The metric produces the following output columns:
group_key: Segmentation group identifier (struct of segment values)volume: Total number of observationsdefaults: Total number of defaultspd: Mean predicted default probabilitybands: Number of bands used in the testhl_statistic: Hosmer-Lemeshow chi-square test statisticp_value: P-value for the test (lower indicates worse calibration)
Fan-out Examples¶
Multiple Calibration Tests¶
collections:
calibration_suite:
metrics:
- name:
- overall_calibration
- product_calibration
- vintage_calibration
segment:
- null
- - product_type
- - origination_year
data_format: record
prob_def: model_probability
default: default_indicator
bands: 10
metric_type: hosmer_lemeshow
dataset: validation_data
This creates three calibration tests:
- Overall model calibration
- Calibration by product type
- Calibration by origination vintage
Different Band Configurations¶
collections:
band_sensitivity:
metrics:
- name:
- hl_5_bands
- hl_10_bands
- hl_20_bands
segment:
- null
- null
- null
data_format: record
prob_def: risk_score
default: default_flag
bands: 5
metric_type: hosmer_lemeshow
dataset: sensitivity_test_data
For different band counts, you need separate metric configurations:
collections:
hl_5_bands:
metrics:
- name:
- calibration_5_bands
bands: 5
data_format: record
prob_def: risk_score
default: default_flag
metric_type: hosmer_lemeshow
dataset: test_data
hl_10_bands:
metrics:
- name:
- calibration_10_bands
bands: 10
data_format: record
prob_def: risk_score
default: default_flag
metric_type: hosmer_lemeshow
dataset: test_data
Data Requirements¶
Record-Level Data¶
- One row per loan/account
- Probability column: numeric values between 0.0 and 1.0
- Default column: binary values (0/1 or boolean)
- Sufficient data for meaningful test (recommended: at least 10 × bands observations)
Summary-Level Data¶
- One row per group/segment
- Mean probabilities: numeric values between 0.0 and 1.0
- Default counts: positive numbers or None (negative values not allowed)
- Volume counts: positive numbers or None (negative values not allowed)
- Adequate sample sizes for chi-square test validity
Interpretation Guidelines¶
P-value Interpretation¶
- p > 0.05: Model calibration is acceptable (fail to reject good fit)
- p ≤ 0.05: Model calibration is questionable (reject good fit)
- p ≤ 0.01: Model calibration is poor (strong evidence against good fit)
Test Statistic¶
- Higher hl_statistic values indicate worse calibration
- Statistic follows chi-square distribution with (bands - 2) degrees of freedom
Important Notes¶
- Sample Size: Test requires adequate sample size for reliable results
- Band Selection: More bands provide finer resolution but require larger samples
- Multiple Testing: When testing multiple segments, consider multiple testing corrections
- Data Quality: Remove observations with missing or invalid probability scores
- Model Assumptions: Test assumes the logistic regression model form is appropriate