Skip to content

Hosmer-Lemeshow Test Metric

The hosmer_lemeshow metric performs a goodness-of-fit test for logistic regression models by comparing observed and expected frequencies across probability bands.

Metric Type: hosmer_lemeshow

Test Overview

The Hosmer-Lemeshow test evaluates model calibration by:

  1. Dividing data into probability bands (default: 10 bands)
  2. Comparing observed vs expected default frequencies in each band
  3. Computing a chi-square test statistic
  4. Calculating a p-value for the goodness-of-fit

Low p-values (< 0.05) suggest poor model calibration.

Configuration Fields

Record-Level Data Format

For individual loan/account records:

collections:
  calibration_test:
    metrics:
    - name:
      - model_calibration
      data_format: record
      prob_def: predicted_probability
      default: default_flag
      bands: 10
      segment:
      - - product_type
      metric_type: hosmer_lemeshow
    dataset: loan_portfolio

Summary-Level Data Format

For pre-aggregated data:

collections:
  summary_calibration:
    metrics:
    - name:
      - aggregated_calibration
      data_format: summary
      mean_pd: avg_probability
      defaults: default_count
      volume: total_count
      bands: 8
      segment:
      - - risk_grade
      metric_type: hosmer_lemeshow
    dataset: risk_summary

Required Fields by Format

Record-Level Required

  • name: Metric name(s)
  • data_format: Must be "record"
  • prob_def: Probability column name
  • default: Default indicator column name
  • dataset: Dataset reference

Summary-Level Required

  • name: Metric name(s)
  • data_format: Must be "summary"
  • mean_pd: Mean probability column name
  • defaults: Default count column name
  • volume: Volume count column name
  • dataset: Dataset reference

Optional Fields

  • segment: List of column names for grouping
  • bands: Number of probability bands (default: 10, minimum: 2)

Output Columns

The metric produces the following output columns:

  • group_key: Segmentation group identifier (struct of segment values)
  • volume: Total number of observations
  • defaults: Total number of defaults
  • pd: Mean predicted default probability
  • bands: Number of bands used in the test
  • hl_statistic: Hosmer-Lemeshow chi-square test statistic
  • p_value: P-value for the test (lower indicates worse calibration)

Fan-out Examples

Multiple Calibration Tests

collections:
  calibration_suite:
    metrics:
    - name:
      - overall_calibration
      - product_calibration
      - vintage_calibration
      segment:
      - null
      - - product_type
      - - origination_year
      data_format: record
      prob_def: model_probability
      default: default_indicator
      bands: 10
      metric_type: hosmer_lemeshow
    dataset: validation_data

This creates three calibration tests:

  1. Overall model calibration
  2. Calibration by product type
  3. Calibration by origination vintage

Different Band Configurations

collections:
  band_sensitivity:
    metrics:
    - name:
      - hl_5_bands
      - hl_10_bands
      - hl_20_bands
      segment:
      - null
      - null
      - null
      data_format: record
      prob_def: risk_score
      default: default_flag
      bands: 5
      metric_type: hosmer_lemeshow
    dataset: sensitivity_test_data

For different band counts, you need separate metric configurations:

collections:
  hl_5_bands:
    metrics:
    - name:
      - calibration_5_bands
      bands: 5
      data_format: record
      prob_def: risk_score
      default: default_flag
      metric_type: hosmer_lemeshow
    dataset: test_data
  hl_10_bands:
    metrics:
    - name:
      - calibration_10_bands
      bands: 10
      data_format: record
      prob_def: risk_score
      default: default_flag
      metric_type: hosmer_lemeshow
    dataset: test_data

Data Requirements

Record-Level Data

  • One row per loan/account
  • Probability column: numeric values between 0.0 and 1.0
  • Default column: binary values (0/1 or boolean)
  • Sufficient data for meaningful test (recommended: at least 10 × bands observations)

Summary-Level Data

  • One row per group/segment
  • Mean probabilities: numeric values between 0.0 and 1.0
  • Default counts: positive numbers or None (negative values not allowed)
  • Volume counts: positive numbers or None (negative values not allowed)
  • Adequate sample sizes for chi-square test validity

Interpretation Guidelines

P-value Interpretation

  • p > 0.05: Model calibration is acceptable (fail to reject good fit)
  • p ≤ 0.05: Model calibration is questionable (reject good fit)
  • p ≤ 0.01: Model calibration is poor (strong evidence against good fit)

Test Statistic

  • Higher hl_statistic values indicate worse calibration
  • Statistic follows chi-square distribution with (bands - 2) degrees of freedom

Important Notes

  1. Sample Size: Test requires adequate sample size for reliable results
  2. Band Selection: More bands provide finer resolution but require larger samples
  3. Multiple Testing: When testing multiple segments, consider multiple testing corrections
  4. Data Quality: Remove observations with missing or invalid probability scores
  5. Model Assumptions: Test assumes the logistic regression model form is appropriate