Skip to content

Hosmer-Lemeshow Test Metric

The hosmer_lemeshow metric performs a goodness-of-fit test for logistic regression models by comparing observed and expected frequencies across probability bands.

Metric Type: hosmer_lemeshow

Test Overview

The Hosmer-Lemeshow test evaluates model calibration by:

  1. Dividing data into probability bands (default: 10 bands)
  2. Comparing observed vs expected default frequencies in each band
  3. Computing a chi-square test statistic
  4. Calculating a p-value for the goodness-of-fit

Low p-values (< 0.05) suggest poor model calibration.

Configuration Fields

Record-Level Data Format

For individual loan/account records:

metrics:
  calibration_test:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["model_calibration"]
      data_format: "record_level"
      prob_def: "predicted_probability" # Column with predicted probabilities (0.0-1.0)
      default: "default_flag" # Column with default indicators (0/1 or boolean)
      bands: 10 # Number of probability bands (default: 10)
      segment: [["product_type"]] # Optional: segmentation columns
      dataset: "loan_portfolio"

Summary-Level Data Format

For pre-aggregated data:

metrics:
  summary_calibration:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["aggregated_calibration"]
      data_format: "summary_level"
      mean_pd: "avg_probability" # Column with mean probabilities (0.0-1.0)
      defaults: "default_count" # Column with default counts
      volume: "total_count" # Column with total observation counts
      bands: 8 # Number of probability bands
      segment: [["risk_grade"]] # Optional: segmentation columns
      dataset: "risk_summary"

Required Fields by Format

Record-Level Required

  • name: Metric name(s)
  • data_format: Must be "record_level"
  • prob_def: Probability column name
  • default: Default indicator column name
  • dataset: Dataset reference

Summary-Level Required

  • name: Metric name(s)
  • data_format: Must be "summary_level"
  • mean_pd: Mean probability column name
  • defaults: Default count column name
  • volume: Volume count column name
  • dataset: Dataset reference

Optional Fields

  • segment: List of column names for grouping
  • bands: Number of probability bands (default: 10, minimum: 2)

Output Columns

The metric produces the following output columns:

  • group_key: Segmentation group identifier (struct of segment values)
  • volume: Total number of observations
  • defaults: Total number of defaults
  • pd: Mean predicted default probability
  • bands: Number of bands used in the test
  • hl_statistic: Hosmer-Lemeshow chi-square test statistic
  • pvalue: P-value for the test (lower indicates worse calibration)

Fan-out Examples

Multiple Calibration Tests

metrics:
  calibration_suite:
    metric_type: "hosmer_lemeshow"
    config:
      name:
        ["overall_calibration", "product_calibration", "vintage_calibration"]
      segment: [null, ["product_type"], ["origination_year"]]
      data_format: "record_level"
      prob_def: "model_probability"
      default: "default_indicator"
      bands: 10
      dataset: "validation_data"

This creates three calibration tests:

  1. Overall model calibration
  2. Calibration by product type
  3. Calibration by origination vintage

Different Band Configurations

metrics:
  band_sensitivity:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["hl_5_bands", "hl_10_bands", "hl_20_bands"]
      segment: [null, null, null]
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      bands: 5 # Note: This applies to all - would need separate configs for different bands
      dataset: "sensitivity_test_data"

For different band counts, you need separate metric configurations:

metrics:
  hl_5_bands:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["calibration_5_bands"]
      bands: 5
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      dataset: "test_data"

  hl_10_bands:
    metric_type: "hosmer_lemeshow"
    config:
      name: ["calibration_10_bands"]
      bands: 10
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      dataset: "test_data"

Data Requirements

Record-Level Data

  • One row per loan/account
  • Probability column: numeric values between 0.0 and 1.0
  • Default column: binary values (0/1 or boolean)
  • Sufficient data for meaningful test (recommended: at least 10 × bands observations)

Summary-Level Data

  • One row per group/segment
  • Mean probabilities: numeric values between 0.0 and 1.0
  • Default counts: positive numbers or None (negative values not allowed)
  • Volume counts: positive numbers or None (negative values not allowed)
  • Adequate sample sizes for chi-square test validity

Interpretation Guidelines

P-value Interpretation

  • p > 0.05: Model calibration is acceptable (fail to reject good fit)
  • p ≤ 0.05: Model calibration is questionable (reject good fit)
  • p ≤ 0.01: Model calibration is poor (strong evidence against good fit)

Test Statistic

  • Higher hl_statistic values indicate worse calibration
  • Statistic follows chi-square distribution with (bands - 2) degrees of freedom

Important Notes

  1. Sample Size: Test requires adequate sample size for reliable results
  2. Band Selection: More bands provide finer resolution but require larger samples
  3. Multiple Testing: When testing multiple segments, consider multiple testing corrections
  4. Data Quality: Remove observations with missing or invalid probability scores
  5. Model Assumptions: Test assumes the logistic regression model form is appropriate