F1 and F2 Score Metrics¶

The f1_score and f2_score metrics calculate F-beta scores, which are weighted harmonic means of precision and recall. These metrics provide comprehensive measures of classification performance with different emphasis on precision versus recall.

Metric Types: f1_score, f2_score

F-Score Calculation¶

F-scores are calculated using the general F-beta formula: (1 + β²) * (precision * recall) / (β² * precision + recall)

Where:

F1 Score (β=1): 2 * (precision * recall) / (precision + recall) - balanced measure
F2 Score (β=2): 5 * (precision * recall) / (4 * precision + recall) - recall-weighted measure
precision = TP / (TP + FP) - fraction of predicted positives that are actually positive
recall = TP / (TP + FN) - fraction of actual positives that are correctly predicted
TP = True Positives, FP = False Positives, FN = False Negatives

Both scores range from 0 to 1, where 1 indicates perfect classification performance.

Key Differences¶

Metric	Beta (β)	Focus	Use Case
F1 Score	1.0	Balanced precision and recall	General classification, equal cost of FP and FN
F2 Score	2.0	Emphasizes recall over precision	High cost of false negatives (medical, fraud, safety)

Configuration Fields¶

Record-Level Data Format¶

For individual observation records with probabilities and binary outcomes:

metrics:
  model_f1:
    metric_type: "f1_score"
    config:
      name: ["classification_f1"]
      data_format: "record_level"
      prob_def: "probability" # Column with predicted probabilities (0-1)
      default: "actual_outcome" # Column with binary outcomes (0/1)
      threshold: 0.5 # Threshold for converting probabilities to predictions
      segment: [["model_version"]] # Optional: segmentation columns
      dataset: "predictions"

  model_f2:
    metric_type: "f2_score"
    config:
      name: ["classification_f2"]
      data_format: "record_level"
      prob_def: "probability"
      default: "actual_outcome"
      threshold: 0.3 # Often lower for F2 to maximize recall
      segment: [["model_version"]]
      dataset: "predictions"

Summary-Level Data Format¶

For pre-aggregated data grouped by risk buckets:

metrics:
  summary_f1:
    metric_type: "f1_score"
    config:
      name: ["aggregated_f1"]
      data_format: "summary_level"
      mean_pd: "mean_probability" # Column with mean probabilities per bucket
      defaults: "default_count" # Column with default counts per bucket
      volume: "observation_count" # Column with total observations per bucket
      threshold: 0.5 # Threshold for treating buckets as positive predictions
      segment: [["data_source"]] # Optional: segmentation columns
      dataset: "risk_buckets"

  summary_f2:
    metric_type: "f2_score"
    config:
      name: ["aggregated_f2"]
      data_format: "summary_level"
      mean_pd: "mean_probability"
      defaults: "default_count"
      volume: "observation_count"
      threshold: 0.4
      segment: [["data_source"]]
      dataset: "risk_buckets"

Required Fields by Format¶

Record-Level Required¶

name: Metric name(s)
data_format: Must be "record_level"
prob_def: Predicted probability column name (values between 0.0 and 1.0)
default: Binary outcome column name (values 0 or 1)
dataset: Dataset reference

Summary-Level Required¶

name: Metric name(s)
data_format: Must be "summary_level"
mean_pd: Mean probability column name (values between 0.0 and 1.0)
defaults: Default count column name (positive numbers)
volume: Volume count column name (positive numbers)
dataset: Dataset reference

Optional Fields¶

threshold: Classification threshold (default: 0.5, range: 0.0-1.0)
segment: List of column names for grouping

Output Columns¶

Both metrics produce the following output columns:

group_key: Segmentation group identifier (struct of segment values)
volume: Total number of observations
defaults: Total number of actual positives
odr: Observed Default Rate (defaults/volume)
pd: Mean predicted probability
precision: Precision score (TP/(TP+FP))
recall: Recall score (TP/(TP+FN))
f_score: F1 or F2 score value (depending on metric type)
tp: True Positives count
fp: False Positives count
fn: False Negatives count

Fan-out Examples¶

Comparative Analysis¶

metrics:
  f_score_comparison:
    metric_type: "f1_score"
    config:
      name: ["balanced_f1", "regional_f1"]
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      threshold: 0.5
      segment: [[], ["region"]]
      dataset: "validation_data"

  recall_focused:
    metric_type: "f2_score"
    config:
      name: ["recall_f2", "regional_f2"]
      data_format: "record_level"
      prob_def: "risk_score"
      default: "default_flag"
      threshold: 0.3 # Lower threshold for better recall
      segment: [[], ["region"]]
      dataset: "validation_data"

Mixed Data Formats¶

metrics:
  detailed_scores:
    metric_type: "f1_score"
    config:
      name: ["record_f1"]
      data_format: "record_level"
      prob_def: "probability"
      default: "outcome"
      threshold: 0.5
      dataset: "detailed_data"

  summary_scores:
    metric_type: "f2_score"
    config:
      name: ["summary_f2"]
      data_format: "summary_level"
      mean_pd: "avg_probability"
      defaults: "default_count"
      volume: "total_count"
      threshold: 0.3
      dataset: "summary_data"

Data Requirements¶

Record-Level Data¶

One row per observation
Probability column: numeric values between 0.0 and 1.0
Default column: binary values (0 or 1)
No missing values in key columns

Summary-Level Data¶

One row per risk bucket/group
Mean probability: numeric values between 0.0 and 1.0
Default counts: non-negative integers
Volume counts: positive integers
Defaults should not exceed volume for any bucket

Score Interpretation¶

Value Guidelines¶

1.0: Perfect classification (both precision and recall = 1.0)
0.8-1.0: Excellent classification performance
0.6-0.8: Good classification performance
0.4-0.6: Fair classification performance
0.0-0.4: Poor classification performance

F1 vs F2 Behavior¶

When recall > precision: - F2 score will be higher than F1 score - F2 moves closer to the recall value - F1 provides a balanced middle ground

When precision > recall: - F1 score will be higher than F2 score - F2 penalizes low recall more heavily - F1 provides better balance for this scenario

Use Case Selection¶

Choose F1 Score When:¶

False positives and false negatives have similar costs
You need a balanced assessment of model performance
General classification tasks
Model comparison across different domains
Benchmark reporting where balance is preferred

Choose F2 Score When:¶

False negatives are more costly than false positives
Medical diagnosis: Missing disease is worse than false alarms
Fraud detection: Missing fraud costs more than false investigations
Safety systems: Failing to detect danger has severe consequences
Quality control: Missing defects is expensive
Customer churn: Missing at-risk customers costs more than false interventions

Threshold Optimization¶

F1 Score Thresholds¶

Often optimized around 0.4-0.6 range
Balances precision and recall trade-offs
Consider ROC curve analysis for selection

F2 Score Thresholds¶

Often optimized at lower values (0.2-0.4)
Prioritizes higher recall
Accept more false positives for better recall

Important Notes¶

Metric Selection: Choose F1 for balanced evaluation, F2 when recall is critical
Threshold Impact: F2 often benefits from lower thresholds than F1
Data Format: Record-level uses individual probabilities; summary-level treats entire buckets as positive/negative
Summary Logic: In summary-level format, buckets with mean_pd ≥ threshold are treated as predicted positive
Complementary Analysis: Use both metrics together for comprehensive evaluation
Business Context: Let the cost of false negatives vs false positives guide metric choice
Mathematical Relationship: F2 will always be between recall and F1 when recall > precision
Edge Cases: When TP+FP=0 or TP+FN=0, the respective precision or recall becomes 0, making F-scores = 0