Skip to content

T-Test Metric

The ttest metric performs a one-sample t-test to determine if the mean difference between observed and predicted values is significantly different from a specified null hypothesis mean. This statistical test provides both the t-statistic and p-value for hypothesis testing in model validation.

Metric Type: ttest

T-Test Calculation

The t-test statistic is calculated as: t = (sample_mean - null_mean) / (sample_std / sqrt(n))

Where:

  • sample_mean = Mean of the differences (observed - predicted)
  • null_mean = Null hypothesis mean (default: 0.0)
  • sample_std = Sample standard deviation of differences (with Bessel's correction)
  • n = Sample size

The p-value is calculated using the t-distribution with degrees of freedom = n - 1, providing the probability of observing the given t-statistic under the null hypothesis.

Configuration Fields

Record-Level Data Format

For individual observation records:

metrics:
  model_ttest:
    metric_type: "ttest"
    config:
      name: ["prediction_test"]
      data_format: "record_level"
      observed: "observed_values" # Column with observed/actual values
      predicted: "predicted_values" # Column with predicted values
      null_hypothesis_mean: 0.0 # Optional: null hypothesis mean (default: 0.0)
      segment: [["model_version"]] # Optional: segmentation columns
      dataset: "predictions"

Summary-Level Data Format

For pre-aggregated difference statistics:

metrics:
  summary_ttest:
    metric_type: "ttest"
    config:
      name: ["aggregated_ttest"]
      data_format: "summary_level"
      volume: "observation_count" # Column with observation counts
      sum_differences: "sum_diff" # Column with sum of differences
      sum_squared_differences: "sum_sq_diff" # Column with sum of squared differences
      null_hypothesis_mean: 0.0 # Optional: null hypothesis mean (default: 0.0)
      segment: [["data_source"]] # Optional: segmentation columns
      dataset: "difference_summary"

Required Fields by Format

Record-Level Required

  • name: Metric name(s)
  • data_format: Must be "record_level"
  • observed: Observed values column name
  • predicted: Predicted values column name
  • dataset: Dataset reference

Summary-Level Required

  • name: Metric name(s)
  • data_format: Must be "summary_level"
  • volume: Volume count column name
  • sum_differences: Sum of differences column name
  • sum_squared_differences: Sum of squared differences column name
  • dataset: Dataset reference

Optional Fields

  • segment: List of column names for grouping
  • null_hypothesis_mean: Null hypothesis mean value (default: 0.0)

Output Columns

The metric produces the following output columns:

  • group_key: Segmentation group identifier (struct of segment values)
  • volume: Total number of observations
  • t_statistic: Calculated t-statistic value
  • p_value: Two-tailed p-value from t-distribution
  • mean_difference: Mean of (observed - predicted) differences

Fan-out Examples

Single Configuration

metrics:
  basic_ttest:
    metric_type: "ttest"
    config:
      name: ["model_ttest"]
      data_format: "record_level"
      observed: "actual_values"
      predicted: "predicted_values"
      dataset: "validation_data"

Segmented Analysis

metrics:
  segmented_ttest:
    metric_type: "ttest"
    config:
      name: ["regional_ttest", "product_ttest"]
      data_format: "record_level"
      observed: "observed_values"
      predicted: "predicted_values"
      segment: [["region"], ["product_type"]]
      dataset: "performance_data"

Custom Null Hypothesis

metrics:
  hypothesis_ttest:
    metric_type: "ttest"
    config:
      name: ["bias_test"]
      data_format: "record_level"
      observed: "actual_sales"
      predicted: "forecast_sales"
      null_hypothesis_mean: 1000.0  # Testing if mean difference equals 1000
      dataset: "sales_data"

Mixed Data Formats

metrics:
  detailed_ttest:
    metric_type: "ttest"
    config:
      name: ["record_level_ttest"]
      data_format: "record_level"
      observed: "actual"
      predicted: "predicted"
      dataset: "detailed_data"

  summary_ttest:
    metric_type: "ttest"
    config:
      name: ["summary_ttest"]
      data_format: "summary_level"
      volume: "count"
      sum_differences: "sum_diff"
      sum_squared_differences: "sum_sq_diff"
      dataset: "summary_data"

Data Requirements

Record-Level Data

  • One row per observation
  • Observed column: numeric values
  • Predicted column: numeric values
  • Minimum 2 observations required for statistical validity

Summary-Level Data

  • One row per group/segment
  • Volume counts: positive integers ≥ 2
  • Sum of differences: numeric values (can be positive, negative, or zero)
  • Sum of squared differences: non-negative numbers

T-Test Interpretation

T-Statistic

  • t > 0: Observed values tend to be higher than predicted
  • t < 0: Observed values tend to be lower than predicted
  • t = 0: No systematic difference between observed and predicted
  • |t| > 2: Generally indicates larger effect sizes (rule of thumb)

P-Value Interpretation

  • p < 0.001: Very strong evidence against null hypothesis
  • p < 0.01: Strong evidence against null hypothesis
  • p < 0.05: Moderate evidence against null hypothesis (common significance level)
  • p < 0.10: Weak evidence against null hypothesis
  • p ≥ 0.10: Insufficient evidence against null hypothesis

Statistical Significance

Choose significance level (α) before testing: - α = 0.05: Standard significance level - α = 0.01: Conservative significance level - α = 0.10: Liberal significance level

If p-value ≤ α, reject null hypothesis.

Use Cases

Model Bias Detection

Test if predictions are systematically biased:

bias_test:
  metric_type: "ttest"
  config:
    name: ["bias_detection"]
    data_format: "record_level"
    observed: "actual_revenue"
    predicted: "model_prediction"
    null_hypothesis_mean: 0.0  # Test for zero bias

Model Comparison

Compare if one model systematically outperforms another:

model_comparison:
  metric_type: "ttest"
  config:
    name: ["model_difference_test"]
    data_format: "record_level"
    observed: "model_a_errors"
    predicted: "model_b_errors"
    null_hypothesis_mean: 0.0  # Test for equal performance

Forecast Validation

Validate forecast accuracy against business targets:

forecast_validation:
  metric_type: "ttest"
  config:
    name: ["forecast_test"]
    data_format: "record_level"
    observed: "actual_demand"
    predicted: "forecast_demand"
    null_hypothesis_mean: 100.0  # Test against target difference

Important Notes

  1. Sample Size: Requires at least 2 observations; results are more reliable with larger samples
  2. Normality Assumption: T-test assumes differences are approximately normally distributed
  3. Independence: Observations should be independent of each other
  4. Two-Tailed Test: The implementation performs two-tailed tests by default
  5. Zero Variance: When all differences are identical, t-statistic is undefined (returns null)
  6. Effect Size: Consider practical significance alongside statistical significance
  7. Multiple Testing: When performing multiple t-tests, consider correction for multiple comparisons
  8. Outliers: T-tests can be sensitive to outliers; consider data cleaning