Jeffreys Test Metric¶
The jeffreys_test metric evaluates model calibration using a Bayesian approach with Jeffreys prior (Beta(0.5, 0.5)) to assess whether predicted probabilities are consistent with observed default rates.
Metric Type: jeffreys_test
Calibration Assessment¶
The Jeffreys test computes a two-tailed p-value by:
- Using a Jeffreys prior Beta(0.5, 0.5) - adds 0.5 to successes and failures
- Creating a posterior distribution: Beta(defaults + 0.5, non-defaults + 0.5)
- Computing how likely the observed mean PD is under this posterior
- Calculating p-value: 2 × min(F(x), 1-F(x)) where F is the Beta CDF
Configuration Fields¶
Record-Level Data Format¶
For individual loan/account records:
collections:
calibration_test:
metrics:
- name:
- model_calibration
data_format: record
prob_def: predicted_probability
default: default_flag
segment:
- - product_type
metric_type: jeffreys_test
dataset: loan_portfolio
Summary-Level Data Format¶
For pre-aggregated data:
collections:
summary_calibration:
metrics:
- name:
- aggregated_calibration
data_format: summary
mean_pd: avg_probability
defaults: default_count
volume: total_count
segment:
- - risk_grade
metric_type: jeffreys_test
dataset: risk_summary
Required Fields by Format¶
Record-Level Required¶
name: Metric name(s)data_format: Must be "record"prob_def: Probability column namedefault: Default indicator column namedataset: Dataset reference
Summary-Level Required¶
name: Metric name(s)data_format: Must be "summary"mean_pd: Mean probability column namedefaults: Default count column namevolume: Volume count column namedataset: Dataset reference
Optional Fields¶
segment: List of column names for grouping
Output Columns¶
The metric produces the following output columns:
group_key: Segmentation group identifier (struct of segment values)volume: Total number of observationsdefaults: Total number of defaultspd: Mean Predicted Default probabilityp_value: Jeffreys test p-value (0.0 to 1.0)
Fan-out Examples¶
Multiple Calibration Tests¶
collections:
model_calibration:
metrics:
- name:
- overall_calibration
- segment_calibration
- product_calibration
segment:
- null
- - customer_segment
- - product_type
data_format: record
prob_def: model_score
default: default_indicator
metric_type: jeffreys_test
dataset: validation_data
This creates three calibration tests:
- Overall portfolio calibration
- Calibration by customer segment
- Calibration by product type
Mixed Data Formats¶
collections:
detailed_calibration:
metrics:
- name:
- record_calibration
data_format: record
prob_def: probability
default: default
metric_type: jeffreys_test
dataset: detailed_data
summary_calibration:
metrics:
- name:
- summary_calibration
data_format: summary
mean_pd: mean_prob
defaults: def_count
volume: vol_count
metric_type: jeffreys_test
dataset: summary_data
Interpretation¶
P-value Guidelines¶
- High p-value (≥ 0.05): Good calibration - predicted probabilities consistent with observed rates
- Low p-value (< 0.05): Poor calibration - significant difference between predicted and observed rates
- Very low p-value (< 0.01): Very poor calibration - substantial miscalibration
Calibration Quality¶
- Well-calibrated models have p-values > 0.05
- Models requiring recalibration typically have p-values < 0.05
- P-values near 0.5 indicate excellent calibration
Data Requirements¶
Record-Level Data¶
- One row per loan/account
- Probability column: numeric values between 0.0 and 1.0
- Default column: binary values (0/1 or boolean)
Summary-Level Data¶
- One row per group/segment
- Mean probability: numeric values between 0.0 and 1.0
- Default counts: positive numbers or None (negative values not allowed)
- Volume counts: positive numbers or None (negative values not allowed)