Jeffreys Test Metric¶

The jeffreys_test metric evaluates model calibration using a Bayesian approach with Jeffreys prior (Beta(0.5, 0.5)) to assess whether predicted probabilities are consistent with observed default rates.

Metric Type: jeffreys_test

Calibration Assessment¶

The Jeffreys test computes a two-tailed p-value by:

Using a Jeffreys prior Beta(0.5, 0.5) - adds 0.5 to successes and failures
Creating a posterior distribution: Beta(defaults + 0.5, non-defaults + 0.5)
Computing how likely the observed mean PD is under this posterior
Calculating p-value: 2 × min(F(x), 1-F(x)) where F is the Beta CDF

Configuration Fields¶

Record-Level Data Format¶

For individual loan/account records:

metrics:
  calibration_test:
    metric_type: "jeffreys_test"
    config:
      name: ["model_calibration"]
      data_format: "record_level"
      prob_def: "predicted_probability" # Column with predicted probabilities (0.0-1.0)
      default: "default_flag" # Column with default indicators (0/1 or boolean)
      segment: [["product_type"]] # Optional: segmentation columns
      dataset: "loan_portfolio"

Summary-Level Data Format¶

For pre-aggregated data:

metrics:
  summary_calibration:
    metric_type: "jeffreys_test"
    config:
      name: ["aggregated_calibration"]
      data_format: "summary_level"
      mean_pd: "avg_probability" # Column with mean probabilities (0.0-1.0)
      defaults: "default_count" # Column with default counts
      volume: "total_count" # Column with total observation counts
      segment: [["risk_grade"]] # Optional: segmentation columns
      dataset: "risk_summary"

Required Fields by Format¶

Record-Level Required¶

name: Metric name(s)
data_format: Must be "record_level"
prob_def: Probability column name
default: Default indicator column name
dataset: Dataset reference

Summary-Level Required¶

name: Metric name(s)
data_format: Must be "summary_level"
mean_pd: Mean probability column name
defaults: Default count column name
volume: Volume count column name
dataset: Dataset reference

Optional Fields¶

segment: List of column names for grouping

Output Columns¶

The metric produces the following output columns:

group_key: Segmentation group identifier (struct of segment values)
volume: Total number of observations
defaults: Total number of defaults
pd: Mean Predicted Default probability
pvalue: Jeffreys test p-value (0.0 to 1.0)

Fan-out Examples¶

Multiple Calibration Tests¶

metrics:
  model_calibration:
    metric_type: "jeffreys_test"
    config:
      name:
        ["overall_calibration", "segment_calibration", "product_calibration"]
      segment: [null, ["customer_segment"], ["product_type"]]
      data_format: "record_level"
      prob_def: "model_score"
      default: "default_indicator"
      dataset: "validation_data"

This creates three calibration tests:

Overall portfolio calibration
Calibration by customer segment
Calibration by product type

Mixed Data Formats¶

metrics:
  detailed_calibration:
    metric_type: "jeffreys_test"
    config:
      name: ["record_level_calibration"]
      data_format: "record_level"
      prob_def: "probability"
      default: "default"
      dataset: "detailed_data"

  summary_calibration:
    metric_type: "jeffreys_test"
    config:
      name: ["summary_calibration"]
      data_format: "summary_level"
      mean_pd: "mean_prob"
      defaults: "def_count"
      volume: "vol_count"
      dataset: "summary_data"

Interpretation¶

P-value Guidelines¶

High p-value (≥ 0.05): Good calibration - predicted probabilities consistent with observed rates
Low p-value (< 0.05): Poor calibration - significant difference between predicted and observed rates
Very low p-value (< 0.01): Very poor calibration - substantial miscalibration

Calibration Quality¶

Well-calibrated models have p-values > 0.05
Models requiring recalibration typically have p-values < 0.05
P-values near 0.5 indicate excellent calibration

Data Requirements¶

Record-Level Data¶

One row per loan/account
Probability column: numeric values between 0.0 and 1.0
Default column: binary values (0/1 or boolean)

Summary-Level Data¶

One row per group/segment
Mean probability: numeric values between 0.0 and 1.0
Default counts: positive numbers or None (negative values not allowed)
Volume counts: positive numbers or None (negative values not allowed)