Hosmer-Lemeshow Test Metric¶

The hosmer_lemeshow metric performs a goodness-of-fit test for logistic regression models by comparing observed and expected frequencies across probability bands.

Metric Type: hosmer_lemeshow

Test Overview¶

The Hosmer-Lemeshow test evaluates model calibration by:

Dividing data into probability bands (default: 10 bands)
Comparing observed vs expected default frequencies in each band
Computing a chi-square test statistic
Calculating a p-value for the goodness-of-fit

Low p-values (< 0.05) suggest poor model calibration.

Configuration Fields¶

Record-Level Data Format¶

For individual loan/account records:

collections:
  calibration_test:
    metrics:
    - name:
      - model_calibration
      data_format: record
      prob_def: predicted_probability
      default: default_flag
      bands: 10
      segment:
      - - product_type
      metric_type: hosmer_lemeshow
    dataset: loan_portfolio

Summary-Level Data Format¶

For pre-aggregated data:

collections:
  summary_calibration:
    metrics:
    - name:
      - aggregated_calibration
      data_format: summary
      mean_pd: avg_probability
      defaults: default_count
      volume: total_count
      bands: 8
      segment:
      - - risk_grade
      metric_type: hosmer_lemeshow
    dataset: risk_summary

Required Fields by Format¶

Record-Level Required¶

name: Metric name(s)
data_format: Must be "record"
prob_def: Probability column name
default: Default indicator column name
dataset: Dataset reference

Summary-Level Required¶

name: Metric name(s)
data_format: Must be "summary"
mean_pd: Mean probability column name
defaults: Default count column name
volume: Volume count column name
dataset: Dataset reference

Optional Fields¶

segment: List of column names for grouping
bands: Number of probability bands (default: 10, minimum: 2)

Output Columns¶

The metric produces the following output columns:

group_key: Segmentation group identifier (struct of segment values)
volume: Total number of observations
defaults: Total number of defaults
pd: Mean predicted default probability
bands: Number of bands used in the test
hl_statistic: Hosmer-Lemeshow chi-square test statistic
p_value: P-value for the test (lower indicates worse calibration)

Fan-out Examples¶

Multiple Calibration Tests¶

collections:
  calibration_suite:
    metrics:
    - name:
      - overall_calibration
      - product_calibration
      - vintage_calibration
      segment:
      - null
      - - product_type
      - - origination_year
      data_format: record
      prob_def: model_probability
      default: default_indicator
      bands: 10
      metric_type: hosmer_lemeshow
    dataset: validation_data

This creates three calibration tests:

Overall model calibration
Calibration by product type
Calibration by origination vintage

Different Band Configurations¶

collections:
  band_sensitivity:
    metrics:
    - name:
      - hl_5_bands
      - hl_10_bands
      - hl_20_bands
      segment:
      - null
      - null
      - null
      data_format: record
      prob_def: risk_score
      default: default_flag
      bands: 5
      metric_type: hosmer_lemeshow
    dataset: sensitivity_test_data

For different band counts, you need separate metric configurations:

collections:
  hl_5_bands:
    metrics:
    - name:
      - calibration_5_bands
      bands: 5
      data_format: record
      prob_def: risk_score
      default: default_flag
      metric_type: hosmer_lemeshow
    dataset: test_data
  hl_10_bands:
    metrics:
    - name:
      - calibration_10_bands
      bands: 10
      data_format: record
      prob_def: risk_score
      default: default_flag
      metric_type: hosmer_lemeshow
    dataset: test_data

Data Requirements¶

Record-Level Data¶

One row per loan/account
Probability column: numeric values between 0.0 and 1.0
Default column: binary values (0/1 or boolean)
Sufficient data for meaningful test (recommended: at least 10 × bands observations)

Summary-Level Data¶

One row per group/segment
Mean probabilities: numeric values between 0.0 and 1.0
Default counts: positive numbers or None (negative values not allowed)
Volume counts: positive numbers or None (negative values not allowed)
Adequate sample sizes for chi-square test validity

Interpretation Guidelines¶

P-value Interpretation¶

p > 0.05: Model calibration is acceptable (fail to reject good fit)
p ≤ 0.05: Model calibration is questionable (reject good fit)
p ≤ 0.01: Model calibration is poor (strong evidence against good fit)

Test Statistic¶

Higher hl_statistic values indicate worse calibration
Statistic follows chi-square distribution with (bands - 2) degrees of freedom

Important Notes¶

Sample Size: Test requires adequate sample size for reliable results
Band Selection: More bands provide finer resolution but require larger samples
Multiple Testing: When testing multiple segments, consider multiple testing corrections
Data Quality: Remove observations with missing or invalid probability scores
Model Assumptions: Test assumes the logistic regression model form is appropriate