Skip to content

F1 and F2 Score Metrics

The f1_score and f2_score metrics calculate F-beta scores, which are weighted harmonic means of precision and recall. These metrics provide comprehensive measures of classification performance with different emphasis on precision versus recall.

Metric Types: f1_score, f2_score

F-Score Calculation

F-scores are calculated using the general F-beta formula: (1 + β²) * (precision * recall) / (β² * precision + recall)

Where:

  • F1 Score (β=1): 2 * (precision * recall) / (precision + recall) - balanced measure
  • F2 Score (β=2): 5 * (precision * recall) / (4 * precision + recall) - recall-weighted measure
  • precision = TP / (TP + FP) - fraction of predicted positives that are actually positive
  • recall = TP / (TP + FN) - fraction of actual positives that are correctly predicted
  • TP = True Positives, FP = False Positives, FN = False Negatives

Both scores range from 0 to 1, where 1 indicates perfect classification performance.

Key Differences

Metric Beta (β) Focus Use Case
F1 Score 1.0 Balanced precision and recall General classification, equal cost of FP and FN
F2 Score 2.0 Emphasizes recall over precision High cost of false negatives (medical, fraud, safety)

Configuration Fields

Record-Level Data Format

For individual observation records with probabilities and binary outcomes:

metrics:
  model_f1:
    metric_type: "f1_score"
    config:
      name: ["classification_f1"]
      data_format: "record_level"
      prob_def: "probability" # Column with predicted probabilities (0-1)
      default: "actual_outcome" # Column with binary outcomes (0/1)
      threshold: 0.5 # Threshold for converting probabilities to predictions
      segment: [["model_version"]] # Optional: segmentation columns
      dataset: "predictions"

  model_f2:
    metric_type: "f2_score"
    config:
      name: ["classification_f2"]
      data_format: "record_level"
      prob_def: "probability"
      default: "actual_outcome"
      threshold: 0.3 # Often lower for F2 to maximize recall
      segment: [["model_version"]]
      dataset: "predictions"

Summary-Level Data Format

For pre-aggregated data grouped by risk buckets:

metrics:
  summary_f1:
    metric_type: "f1_score"
    config:
      name: ["aggregated_f1"]
      data_format: "summary_level"
      mean_pd: "mean_probability" # Column with mean probabilities per bucket
      defaults: "default_count" # Column with default counts per bucket
      volume: "observation_count" # Column with total observations per bucket
      threshold: 0.5 # Threshold for treating buckets as positive predictions
      segment: [["data_source"]] # Optional: segmentation columns
      dataset: "risk_buckets"

  summary_f2:
    metric_type: "f2_score"
    config:
      name: ["aggregated_f2"]
      data_format: "summary_level"
      mean_pd: "mean_probability"
      defaults: "default_count"
      volume: "observation_count"
      threshold: 0.4
      segment: [["data_source"]]
      dataset: "risk_buckets"

Required Fields by Format

Record-Level Required

  • name: Metric name(s)
  • data_format: Must be "record_level"
  • prob_def: Predicted probability column name (values between 0.0 and 1.0)
  • default: Binary outcome column name (values 0 or 1)
  • dataset: Dataset reference

Summary-Level Required

  • name: Metric name(s)
  • data_format: Must be "summary_level"
  • mean_pd: Mean probability column name (values between 0.0 and 1.0)
  • defaults: Default count column name (positive numbers)
  • volume: Volume count column name (positive numbers)
  • dataset: Dataset reference

Optional Fields

  • threshold: Classification threshold (default: 0.5, range: 0.0-1.0)
  • segment: List of column names for grouping

Output Columns

Both metrics produce the following output columns:

  • group_key: Segmentation group identifier (struct of segment values)
  • volume: Total number of observations
  • defaults: Total number of actual positives
  • odr: Observed Default Rate (defaults/volume)
  • pd: Mean predicted probability
  • precision: Precision score (TP/(TP+FP))
  • recall: Recall score (TP/(TP+FN))
  • f_score: F1 or F2 score value (depending on metric type)
  • tp: True Positives count
  • fp: False Positives count
  • fn: False Negatives count

Fan-out Examples

Comparative Analysis

metrics:
  f_score_comparison:
    metric_type: "f1_score"
    config:
      name: ["balanced_f1", "regional_f1"]
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      threshold: 0.5
      segment: [[], ["region"]]
      dataset: "validation_data"

  recall_focused:
    metric_type: "f2_score"
    config:
      name: ["recall_f2", "regional_f2"]
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      threshold: 0.3 # Lower threshold for better recall
      segment: [[], ["region"]]
      dataset: "validation_data"

Mixed Data Formats

metrics:
  detailed_scores:
    metric_type: "f1_score"
    config:
      name: ["record_f1"]
      data_format: "record_level"
      prob_def: "probability"
      default: "outcome"
      threshold: 0.5
      dataset: "detailed_data"

  summary_scores:
    metric_type: "f2_score"
    config:
      name: ["summary_f2"]
      data_format: "summary_level"
      mean_pd: "avg_probability"
      defaults: "default_count"
      volume: "total_count"
      threshold: 0.3
      dataset: "summary_data"

Data Requirements

Record-Level Data

  • One row per observation
  • Probability column: numeric values between 0.0 and 1.0
  • Default column: binary values (0 or 1)
  • No missing values in key columns

Summary-Level Data

  • One row per risk bucket/group
  • Mean probability: numeric values between 0.0 and 1.0
  • Default counts: non-negative integers
  • Volume counts: positive integers
  • Defaults should not exceed volume for any bucket

Score Interpretation

Value Guidelines

  • 1.0: Perfect classification (both precision and recall = 1.0)
  • 0.8-1.0: Excellent classification performance
  • 0.6-0.8: Good classification performance
  • 0.4-0.6: Fair classification performance
  • 0.0-0.4: Poor classification performance

F1 vs F2 Behavior

When recall > precision: - F2 score will be higher than F1 score - F2 moves closer to the recall value - F1 provides a balanced middle ground

When precision > recall: - F1 score will be higher than F2 score - F2 penalizes low recall more heavily - F1 provides better balance for this scenario

Use Case Selection

Choose F1 Score When:

  • False positives and false negatives have similar costs
  • You need a balanced assessment of model performance
  • General classification tasks
  • Model comparison across different domains
  • Benchmark reporting where balance is preferred

Choose F2 Score When:

  • False negatives are more costly than false positives
  • Medical diagnosis: Missing disease is worse than false alarms
  • Fraud detection: Missing fraud costs more than false investigations
  • Safety systems: Failing to detect danger has severe consequences
  • Quality control: Missing defects is expensive
  • Customer churn: Missing at-risk customers costs more than false interventions

Threshold Optimization

F1 Score Thresholds

  • Often optimized around 0.4-0.6 range
  • Balances precision and recall trade-offs
  • Consider ROC curve analysis for selection

F2 Score Thresholds

  • Often optimized at lower values (0.2-0.4)
  • Prioritizes higher recall
  • Accept more false positives for better recall

Important Notes

  1. Metric Selection: Choose F1 for balanced evaluation, F2 when recall is critical
  2. Threshold Impact: F2 often benefits from lower thresholds than F1
  3. Data Format: Record-level uses individual probabilities; summary-level treats entire buckets as positive/negative
  4. Summary Logic: In summary-level format, buckets with mean_pd ≥ threshold are treated as predicted positive
  5. Complementary Analysis: Use both metrics together for comprehensive evaluation
  6. Business Context: Let the cost of false negatives vs false positives guide metric choice
  7. Mathematical Relationship: F2 will always be between recall and F1 when recall > precision
  8. Edge Cases: When TP+FP=0 or TP+FN=0, the respective precision or recall becomes 0, making F-scores = 0