T-Test Metric¶

The ttest metric performs a one-sample t-test to determine if the mean difference between observed and predicted values is significantly different from a specified null hypothesis mean. This statistical test provides both the t-statistic and p-value for hypothesis testing in model validation.

Metric Type: ttest

T-Test Calculation¶

The t-test statistic is calculated as: t = (sample_mean - null_mean) / (sample_std / sqrt(n))

Where:

sample_mean = Mean of the differences (observed - predicted)
null_mean = Null hypothesis mean (default: 0.0)
sample_std = Sample standard deviation of differences (with Bessel's correction)
n = Sample size

The p-value is calculated using the t-distribution with degrees of freedom = n - 1, providing the probability of observing the given t-statistic under the null hypothesis.

Configuration Fields¶

Record-Level Data Format¶

For individual observation records:

collections:
  model_ttest:
    metrics:
    - name:
      - prediction_test
      data_format: record
      observed: observed_values
      predicted: predicted_values
      null_hypothesis_mean: 0.0
      segment:
      - - model_version
      metric_type: ttest
    dataset: predictions

Summary-Level Data Format¶

For pre-aggregated difference statistics:

collections:
  summary_ttest:
    metrics:
    - name:
      - aggregated_ttest
      data_format: summary
      volume: observation_count
      sum_differences: sum_diff
      sum_squared_differences: sum_sq_diff
      null_hypothesis_mean: 0.0
      segment:
      - - data_source
      metric_type: ttest
    dataset: difference_summary

Required Fields by Format¶

Record-Level Required¶

name: Metric name(s)
data_format: Must be "record"
observed: Observed values column name
predicted: Predicted values column name
dataset: Dataset reference

Summary-Level Required¶

name: Metric name(s)
data_format: Must be "summary"
volume: Volume count column name
sum_differences: Sum of differences column name
sum_squared_differences: Sum of squared differences column name
dataset: Dataset reference

Optional Fields¶

segment: List of column names for grouping
null_hypothesis_mean: Null hypothesis mean value (default: 0.0)

Output Columns¶

The metric produces the following output columns:

group_key: Segmentation group identifier (struct of segment values)
volume: Total number of observations
t_statistic: Calculated t-statistic value
p_value: Two-tailed p-value from t-distribution
mean_difference: Mean of (observed - predicted) differences

Fan-out Examples¶

Single Configuration¶

collections:
  basic_ttest:
    metrics:
    - name:
      - model_ttest
      data_format: record
      observed: actual_values
      predicted: predicted_values
      metric_type: ttest
    dataset: validation_data

Segmented Analysis¶

collections:
  segmented_ttest:
    metrics:
    - name:
      - regional_ttest
      - product_ttest
      data_format: record
      observed: observed_values
      predicted: predicted_values
      segment:
      - - region
      - - product_type
      metric_type: ttest
    dataset: performance_data

Custom Null Hypothesis¶

collections:
  hypothesis_ttest:
    metrics:
    - name:
      - bias_test
      data_format: record
      observed: actual_sales
      predicted: forecast_sales
      null_hypothesis_mean: 1000.0
      metric_type: ttest
    dataset: sales_data

Mixed Data Formats¶

collections:
  detailed_ttest:
    metrics:
    - name:
      - record_ttest
      data_format: record
      observed: actual
      predicted: predicted
      metric_type: ttest
    dataset: detailed_data
  summary_ttest:
    metrics:
    - name:
      - summary_ttest
      data_format: summary
      volume: count
      sum_differences: sum_diff
      sum_squared_differences: sum_sq_diff
      metric_type: ttest
    dataset: summary_data

Data Requirements¶

Record-Level Data¶

One row per observation
Observed column: numeric values
Predicted column: numeric values
Minimum 2 observations required for statistical validity

Summary-Level Data¶

One row per group/segment
Volume counts: positive integers ≥ 2
Sum of differences: numeric values (can be positive, negative, or zero)
Sum of squared differences: non-negative numbers

T-Test Interpretation¶

T-Statistic¶

t > 0: Observed values tend to be higher than predicted
t < 0: Observed values tend to be lower than predicted
t = 0: No systematic difference between observed and predicted
|t| > 2: Generally indicates larger effect sizes (rule of thumb)

P-Value Interpretation¶

p < 0.001: Very strong evidence against null hypothesis
p < 0.01: Strong evidence against null hypothesis
p < 0.05: Moderate evidence against null hypothesis (common significance level)
p < 0.10: Weak evidence against null hypothesis
p ≥ 0.10: Insufficient evidence against null hypothesis

Statistical Significance¶

Choose significance level (α) before testing: - α = 0.05: Standard significance level - α = 0.01: Conservative significance level - α = 0.10: Liberal significance level

If p-value ≤ α, reject null hypothesis.

Use Cases¶

Model Bias Detection¶

Test if predictions are systematically biased:

collections:
  bias_test:
    dataset: model_data
    metrics:
      - metric_type: ttest
        data_format: record
        name: bias_detection
        observed: actual_revenue
        predicted: model_prediction
        null_hypothesis_mean: 0.0  # Test for zero bias

Model Comparison¶

Compare if one model systematically outperforms another:

collections:
  model_comparison:
    dataset: model_comparison_data
    metrics:
      - metric_type: ttest
        data_format: record
        name: model_difference_test
        observed: model_a_errors
        predicted: model_b_errors
        null_hypothesis_mean: 0.0  # Test for equal performance

Forecast Validation¶

Validate forecast accuracy against business targets:

collections:
  forecast_validation:
    dataset: forecast_data
    metrics:
      - metric_type: ttest
        data_format: record
        name: forecast_test
        observed: actual_demand
        predicted: forecast_demand
        null_hypothesis_mean: 100.0  # Test against target difference

Important Notes¶

Sample Size: Requires at least 2 observations; results are more reliable with larger samples
Normality Assumption: T-test assumes differences are approximately normally distributed
Independence: Observations should be independent of each other
Two-Tailed Test: The implementation performs two-tailed tests by default
Zero Variance: When all differences are identical, t-statistic is undefined (returns null)
Effect Size: Consider practical significance alongside statistical significance
Multiple Testing: When performing multiple t-tests, consider correction for multiple comparisons
Outliers: T-tests can be sensitive to outliers; consider data cleaning