Skip to content

Jeffreys Test Metric

The jeffreys_test metric evaluates model calibration using a Bayesian approach with Jeffreys prior (Beta(0.5, 0.5)) to assess whether predicted probabilities are consistent with observed default rates.

Metric Type: jeffreys_test

Calibration Assessment

The Jeffreys test computes a two-tailed p-value by:

  1. Using a Jeffreys prior Beta(0.5, 0.5) - adds 0.5 to successes and failures
  2. Creating a posterior distribution: Beta(defaults + 0.5, non-defaults + 0.5)
  3. Computing how likely the observed mean PD is under this posterior
  4. Calculating p-value: 2 × min(F(x), 1-F(x)) where F is the Beta CDF

Configuration Fields

Record-Level Data Format

For individual loan/account records:

collections:
  calibration_test:
    metrics:
    - name:
      - model_calibration
      data_format: record
      prob_def: predicted_probability
      default: default_flag
      segment:
      - - product_type
      metric_type: jeffreys_test
    dataset: loan_portfolio

Summary-Level Data Format

For pre-aggregated data:

collections:
  summary_calibration:
    metrics:
    - name:
      - aggregated_calibration
      data_format: summary
      mean_pd: avg_probability
      defaults: default_count
      volume: total_count
      segment:
      - - risk_grade
      metric_type: jeffreys_test
    dataset: risk_summary

Required Fields by Format

Record-Level Required

  • name: Metric name(s)
  • data_format: Must be "record"
  • prob_def: Probability column name
  • default: Default indicator column name
  • dataset: Dataset reference

Summary-Level Required

  • name: Metric name(s)
  • data_format: Must be "summary"
  • mean_pd: Mean probability column name
  • defaults: Default count column name
  • volume: Volume count column name
  • dataset: Dataset reference

Optional Fields

  • segment: List of column names for grouping

Output Columns

The metric produces the following output columns:

  • group_key: Segmentation group identifier (struct of segment values)
  • volume: Total number of observations
  • defaults: Total number of defaults
  • pd: Mean Predicted Default probability
  • p_value: Jeffreys test p-value (0.0 to 1.0)

Fan-out Examples

Multiple Calibration Tests

collections:
  model_calibration:
    metrics:
    - name:
      - overall_calibration
      - segment_calibration
      - product_calibration
      segment:
      - null
      - - customer_segment
      - - product_type
      data_format: record
      prob_def: model_score
      default: default_indicator
      metric_type: jeffreys_test
    dataset: validation_data

This creates three calibration tests:

  1. Overall portfolio calibration
  2. Calibration by customer segment
  3. Calibration by product type

Mixed Data Formats

collections:
  detailed_calibration:
    metrics:
    - name:
      - record_calibration
      data_format: record
      prob_def: probability
      default: default
      metric_type: jeffreys_test
    dataset: detailed_data
  summary_calibration:
    metrics:
    - name:
      - summary_calibration
      data_format: summary
      mean_pd: mean_prob
      defaults: def_count
      volume: vol_count
      metric_type: jeffreys_test
    dataset: summary_data

Interpretation

P-value Guidelines

  • High p-value (≥ 0.05): Good calibration - predicted probabilities consistent with observed rates
  • Low p-value (< 0.05): Poor calibration - significant difference between predicted and observed rates
  • Very low p-value (< 0.01): Very poor calibration - substantial miscalibration

Calibration Quality

  • Well-calibrated models have p-values > 0.05
  • Models requiring recalibration typically have p-values < 0.05
  • P-values near 0.5 indicate excellent calibration

Data Requirements

Record-Level Data

  • One row per loan/account
  • Probability column: numeric values between 0.0 and 1.0
  • Default column: binary values (0/1 or boolean)

Summary-Level Data

  • One row per group/segment
  • Mean probability: numeric values between 0.0 and 1.0
  • Default counts: positive numbers or None (negative values not allowed)
  • Volume counts: positive numbers or None (negative values not allowed)