T-Test Metric¶
The ttest metric performs a one-sample t-test to determine if the mean difference between observed and predicted values is significantly different from a specified null hypothesis mean. This statistical test provides both the t-statistic and p-value for hypothesis testing in model validation.
Metric Type: ttest
T-Test Calculation¶
The t-test statistic is calculated as: t = (sample_mean - null_mean) / (sample_std / sqrt(n))
Where:
- sample_mean = Mean of the differences (observed - predicted)
- null_mean = Null hypothesis mean (default: 0.0)
- sample_std = Sample standard deviation of differences (with Bessel's correction)
- n = Sample size
The p-value is calculated using the t-distribution with degrees of freedom = n - 1, providing the probability of observing the given t-statistic under the null hypothesis.
Configuration Fields¶
Record-Level Data Format¶
For individual observation records:
metrics:
model_ttest:
metric_type: "ttest"
config:
name: ["prediction_test"]
data_format: "record_level"
observed: "observed_values" # Column with observed/actual values
predicted: "predicted_values" # Column with predicted values
null_hypothesis_mean: 0.0 # Optional: null hypothesis mean (default: 0.0)
segment: [["model_version"]] # Optional: segmentation columns
dataset: "predictions"
Summary-Level Data Format¶
For pre-aggregated difference statistics:
metrics:
summary_ttest:
metric_type: "ttest"
config:
name: ["aggregated_ttest"]
data_format: "summary_level"
volume: "observation_count" # Column with observation counts
sum_differences: "sum_diff" # Column with sum of differences
sum_squared_differences: "sum_sq_diff" # Column with sum of squared differences
null_hypothesis_mean: 0.0 # Optional: null hypothesis mean (default: 0.0)
segment: [["data_source"]] # Optional: segmentation columns
dataset: "difference_summary"
Required Fields by Format¶
Record-Level Required¶
name: Metric name(s)data_format: Must be "record_level"observed: Observed values column namepredicted: Predicted values column namedataset: Dataset reference
Summary-Level Required¶
name: Metric name(s)data_format: Must be "summary_level"volume: Volume count column namesum_differences: Sum of differences column namesum_squared_differences: Sum of squared differences column namedataset: Dataset reference
Optional Fields¶
segment: List of column names for groupingnull_hypothesis_mean: Null hypothesis mean value (default: 0.0)
Output Columns¶
The metric produces the following output columns:
group_key: Segmentation group identifier (struct of segment values)volume: Total number of observationst_statistic: Calculated t-statistic valuep_value: Two-tailed p-value from t-distributionmean_difference: Mean of (observed - predicted) differences
Fan-out Examples¶
Single Configuration¶
metrics:
basic_ttest:
metric_type: "ttest"
config:
name: ["model_ttest"]
data_format: "record_level"
observed: "actual_values"
predicted: "predicted_values"
dataset: "validation_data"
Segmented Analysis¶
metrics:
segmented_ttest:
metric_type: "ttest"
config:
name: ["regional_ttest", "product_ttest"]
data_format: "record_level"
observed: "observed_values"
predicted: "predicted_values"
segment: [["region"], ["product_type"]]
dataset: "performance_data"
Custom Null Hypothesis¶
metrics:
hypothesis_ttest:
metric_type: "ttest"
config:
name: ["bias_test"]
data_format: "record_level"
observed: "actual_sales"
predicted: "forecast_sales"
null_hypothesis_mean: 1000.0 # Testing if mean difference equals 1000
dataset: "sales_data"
Mixed Data Formats¶
metrics:
detailed_ttest:
metric_type: "ttest"
config:
name: ["record_level_ttest"]
data_format: "record_level"
observed: "actual"
predicted: "predicted"
dataset: "detailed_data"
summary_ttest:
metric_type: "ttest"
config:
name: ["summary_ttest"]
data_format: "summary_level"
volume: "count"
sum_differences: "sum_diff"
sum_squared_differences: "sum_sq_diff"
dataset: "summary_data"
Data Requirements¶
Record-Level Data¶
- One row per observation
- Observed column: numeric values
- Predicted column: numeric values
- Minimum 2 observations required for statistical validity
Summary-Level Data¶
- One row per group/segment
- Volume counts: positive integers ≥ 2
- Sum of differences: numeric values (can be positive, negative, or zero)
- Sum of squared differences: non-negative numbers
T-Test Interpretation¶
T-Statistic¶
- t > 0: Observed values tend to be higher than predicted
- t < 0: Observed values tend to be lower than predicted
- t = 0: No systematic difference between observed and predicted
- |t| > 2: Generally indicates larger effect sizes (rule of thumb)
P-Value Interpretation¶
- p < 0.001: Very strong evidence against null hypothesis
- p < 0.01: Strong evidence against null hypothesis
- p < 0.05: Moderate evidence against null hypothesis (common significance level)
- p < 0.10: Weak evidence against null hypothesis
- p ≥ 0.10: Insufficient evidence against null hypothesis
Statistical Significance¶
Choose significance level (α) before testing: - α = 0.05: Standard significance level - α = 0.01: Conservative significance level - α = 0.10: Liberal significance level
If p-value ≤ α, reject null hypothesis.
Use Cases¶
Model Bias Detection¶
Test if predictions are systematically biased:
bias_test:
metric_type: "ttest"
config:
name: ["bias_detection"]
data_format: "record_level"
observed: "actual_revenue"
predicted: "model_prediction"
null_hypothesis_mean: 0.0 # Test for zero bias
Model Comparison¶
Compare if one model systematically outperforms another:
model_comparison:
metric_type: "ttest"
config:
name: ["model_difference_test"]
data_format: "record_level"
observed: "model_a_errors"
predicted: "model_b_errors"
null_hypothesis_mean: 0.0 # Test for equal performance
Forecast Validation¶
Validate forecast accuracy against business targets:
forecast_validation:
metric_type: "ttest"
config:
name: ["forecast_test"]
data_format: "record_level"
observed: "actual_demand"
predicted: "forecast_demand"
null_hypothesis_mean: 100.0 # Test against target difference
Important Notes¶
- Sample Size: Requires at least 2 observations; results are more reliable with larger samples
- Normality Assumption: T-test assumes differences are approximately normally distributed
- Independence: Observations should be independent of each other
- Two-Tailed Test: The implementation performs two-tailed tests by default
- Zero Variance: When all differences are identical, t-statistic is undefined (returns null)
- Effect Size: Consider practical significance alongside statistical significance
- Multiple Testing: When performing multiple t-tests, consider correction for multiple comparisons
- Outliers: T-tests can be sensitive to outliers; consider data cleaning