Hosmer-Lemeshow Test Metric¶

The hosmer_lemeshow metric performs a goodness-of-fit test for logistic regression models by comparing observed and expected frequencies across probability bands.

Metric Type: hosmer_lemeshow

Test Overview¶

The Hosmer-Lemeshow test evaluates model calibration by:

Dividing data into probability bands (default: 10 bands)
Comparing observed vs expected default frequencies in each band
Computing a chi-square test statistic
Calculating a p-value for the goodness-of-fit

Low p-values (< 0.05) suggest poor model calibration.

Configuration Fields¶

Record-Level Data Format¶

For individual loan/account records:

metrics:
  calibration_test:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["model_calibration"]
      data_format: "record_level"
      prob_def: "predicted_probability" # Column with predicted probabilities (0.0-1.0)
      default: "default_flag" # Column with default indicators (0/1 or boolean)
      bands: 10 # Number of probability bands (default: 10)
      segment: [["product_type"]] # Optional: segmentation columns
      dataset: "loan_portfolio"

Summary-Level Data Format¶

For pre-aggregated data:

metrics:
  summary_calibration:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["aggregated_calibration"]
      data_format: "summary_level"
      mean_pd: "avg_probability" # Column with mean probabilities (0.0-1.0)
      defaults: "default_count" # Column with default counts
      volume: "total_count" # Column with total observation counts
      bands: 8 # Number of probability bands
      segment: [["risk_grade"]] # Optional: segmentation columns
      dataset: "risk_summary"

Required Fields by Format¶

Record-Level Required¶

name: Metric name(s)
data_format: Must be "record_level"
prob_def: Probability column name
default: Default indicator column name
dataset: Dataset reference

Summary-Level Required¶

name: Metric name(s)
data_format: Must be "summary_level"
mean_pd: Mean probability column name
defaults: Default count column name
volume: Volume count column name
dataset: Dataset reference

Optional Fields¶

segment: List of column names for grouping
bands: Number of probability bands (default: 10, minimum: 2)

Output Columns¶

The metric produces the following output columns:

group_key: Segmentation group identifier (struct of segment values)
volume: Total number of observations
defaults: Total number of defaults
pd: Mean predicted default probability
bands: Number of bands used in the test
hl_statistic: Hosmer-Lemeshow chi-square test statistic
pvalue: P-value for the test (lower indicates worse calibration)

Fan-out Examples¶

Multiple Calibration Tests¶

metrics:
  calibration_suite:
    metric_type: "hosmer_lemeshow"
    config:
      name:
        ["overall_calibration", "product_calibration", "vintage_calibration"]
      segment: [null, ["product_type"], ["origination_year"]]
      data_format: "record_level"
      prob_def: "model_probability"
      default: "default_indicator"
      bands: 10
      dataset: "validation_data"

This creates three calibration tests:

Overall model calibration
Calibration by product type
Calibration by origination vintage

Different Band Configurations¶

metrics:
  band_sensitivity:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["hl_5_bands", "hl_10_bands", "hl_20_bands"]
      segment: [null, null, null]
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      bands: 5 # Note: This applies to all - would need separate configs for different bands
      dataset: "sensitivity_test_data"

For different band counts, you need separate metric configurations:

metrics:
  hl_5_bands:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["calibration_5_bands"]
      bands: 5
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      dataset: "test_data"

  hl_10_bands:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["calibration_10_bands"]
      bands: 10
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      dataset: "test_data"

Data Requirements¶

Record-Level Data¶

One row per loan/account
Probability column: numeric values between 0.0 and 1.0
Default column: binary values (0/1 or boolean)
Sufficient data for meaningful test (recommended: at least 10 × bands observations)

Summary-Level Data¶

One row per group/segment
Mean probabilities: numeric values between 0.0 and 1.0
Default counts: positive numbers or None (negative values not allowed)
Volume counts: positive numbers or None (negative values not allowed)
Adequate sample sizes for chi-square test validity

Interpretation Guidelines¶

P-value Interpretation¶

p > 0.05: Model calibration is acceptable (fail to reject good fit)
p ≤ 0.05: Model calibration is questionable (reject good fit)
p ≤ 0.01: Model calibration is poor (strong evidence against good fit)

Test Statistic¶

Higher hl_statistic values indicate worse calibration
Statistic follows chi-square distribution with (bands - 2) degrees of freedom

Important Notes¶

Sample Size: Test requires adequate sample size for reliable results
Band Selection: More bands provide finer resolution but require larger samples
Multiple Testing: When testing multiple segments, consider multiple testing corrections
Data Quality: Remove observations with missing or invalid probability scores
Model Assumptions: Test assumes the logistic regression model form is appropriate