F1 and F2 Score Metrics¶

The f1_score and f2_score metrics calculate F-beta scores, which are weighted harmonic means of precision and recall. These metrics provide comprehensive measures of classification performance with different emphasis on precision versus recall.

Metric Types: f1_score, f2_score

F-Score Calculation¶

F-scores are calculated using the general F-beta formula: (1 + β²) * (precision * recall) / (β² * precision + recall)

Where:

F1 Score (β=1): 2 * (precision * recall) / (precision + recall) - balanced measure
F2 Score (β=2): 5 * (precision * recall) / (4 * precision + recall) - recall-weighted measure
precision = TP / (TP + FP) - fraction of predicted positives that are actually positive
recall = TP / (TP + FN) - fraction of actual positives that are correctly predicted
TP = True Positives, FP = False Positives, FN = False Negatives

Both scores range from 0 to 1, where 1 indicates perfect classification performance.

Key Differences¶

Metric	Beta (β)	Focus	Use Case
F1 Score	1.0	Balanced precision and recall	General classification, equal cost of FP and FN
F2 Score	2.0	Emphasizes recall over precision	High cost of false negatives (medical, fraud, safety)

Configuration Fields¶

Record-Level Data Format¶

For individual observation records with probabilities and binary outcomes:

collections:
  model_f1:
    metrics:
    - name:
      - classification_f1
      data_format: record
      prob_def: probability
      default: actual_outcome
      threshold: 0.5
      segment:
      - - model_version
      metric_type: f1_score
    dataset: predictions
  model_f2:
    metrics:
    - name:
      - classification_f2
      data_format: record
      prob_def: probability
      default: actual_outcome
      threshold: 0.3
      segment:
      - - model_version
      metric_type: f2_score
    dataset: predictions

Summary-Level Data Format¶

For pre-aggregated data grouped by risk buckets:

collections:
  summary_f1:
    metrics:
    - name:
      - aggregated_f1
      data_format: summary
      mean_pd: mean_probability
      defaults: default_count
      volume: observation_count
      threshold: 0.5
      segment:
      - - data_source
      metric_type: f1_score
    dataset: risk_buckets
  summary_f2:
    metrics:
    - name:
      - aggregated_f2
      data_format: summary
      mean_pd: mean_probability
      defaults: default_count
      volume: observation_count
      threshold: 0.4
      segment:
      - - data_source
      metric_type: f2_score
    dataset: risk_buckets

Required Fields by Format¶

Record-Level Required¶

name: Metric name(s)
data_format: Must be "record"
prob_def: Predicted probability column name (values between 0.0 and 1.0)
default: Binary outcome column name (values 0 or 1)
dataset: Dataset reference

Summary-Level Required¶

name: Metric name(s)
data_format: Must be "summary"
mean_pd: Mean probability column name (values between 0.0 and 1.0)
defaults: Default count column name (positive numbers)
volume: Volume count column name (positive numbers)
dataset: Dataset reference

Optional Fields¶

threshold: Classification threshold (default: 0.5, range: 0.0-1.0)
segment: List of column names for grouping

Output Columns¶

Both metrics produce the following output columns:

group_key: Segmentation group identifier (struct of segment values)
volume: Total number of observations
defaults: Total number of actual positives
odr: Observed Default Rate (defaults/volume)
pd: Mean predicted probability
precision: Precision score (TP/(TP+FP))
recall: Recall score (TP/(TP+FN))
f_score: F1 or F2 score value (depending on metric type)
tp: True Positives count
fp: False Positives count
fn: False Negatives count

Fan-out Examples¶

Comparative Analysis¶

collections:
  f_score_comparison:
    metrics:
    - name:
      - balanced_f1
      - regional_f1
      data_format: record
      prob_def: risk_score
      default: default_flag
      threshold: 0.5
      segment:
      - []
      - - region
      metric_type: f1_score
    dataset: validation_data
  recall_focused:
    metrics:
    - name:
      - recall_f2
      - regional_f2
      data_format: record
      prob_def: risk_score
      default: default_flag
      threshold: 0.3
      segment:
      - []
      - - region
      metric_type: f2_score
    dataset: validation_data

Mixed Data Formats¶

collections:
  detailed_scores:
    metrics:
    - name:
      - record_f1
      data_format: record
      prob_def: probability
      default: outcome
      threshold: 0.5
      metric_type: f1_score
    dataset: detailed_data
  summary_scores:
    metrics:
    - name:
      - summary_f2
      data_format: summary
      mean_pd: avg_probability
      defaults: default_count
      volume: total_count
      threshold: 0.3
      metric_type: f2_score
    dataset: summary_data

Data Requirements¶

Record-Level Data¶

One row per observation
Probability column: numeric values between 0.0 and 1.0
Default column: binary values (0 or 1)
No missing values in key columns

Summary-Level Data¶

One row per risk bucket/group
Mean probability: numeric values between 0.0 and 1.0
Default counts: non-negative integers
Volume counts: positive integers
Defaults should not exceed volume for any bucket

Score Interpretation¶

Value Guidelines¶

1.0: Perfect classification (both precision and recall = 1.0)
0.8-1.0: Excellent classification performance
0.6-0.8: Good classification performance
0.4-0.6: Fair classification performance
0.0-0.4: Poor classification performance

F1 vs F2 Behavior¶

When recall > precision: - F2 score will be higher than F1 score - F2 moves closer to the recall value - F1 provides a balanced middle ground

When precision > recall: - F1 score will be higher than F2 score - F2 penalizes low recall more heavily - F1 provides better balance for this scenario

Use Case Selection¶

Choose F1 Score When:¶

False positives and false negatives have similar costs
You need a balanced assessment of model performance
General classification tasks
Model comparison across different domains
Benchmark reporting where balance is preferred

Choose F2 Score When:¶

False negatives are more costly than false positives
Medical diagnosis: Missing disease is worse than false alarms
Fraud detection: Missing fraud costs more than false investigations
Safety systems: Failing to detect danger has severe consequences
Quality control: Missing defects is expensive
Customer churn: Missing at-risk customers costs more than false interventions

Threshold Optimization¶

F1 Score Thresholds¶

Often optimized around 0.4-0.6 range
Balances precision and recall trade-offs
Consider ROC curve analysis for selection

F2 Score Thresholds¶

Often optimized at lower values (0.2-0.4)
Prioritizes higher recall
Accept more false positives for better recall

Important Notes¶

Metric Selection: Choose F1 for balanced evaluation, F2 when recall is critical
Threshold Impact: F2 often benefits from lower thresholds than F1
Data Format: Record-level uses individual probabilities; summary-level treats entire buckets as positive/negative
Summary Logic: In summary-level format, buckets with mean_pd ≥ threshold are treated as predicted positive
Complementary Analysis: Use both metrics together for comprehensive evaluation
Business Context: Let the cost of false negatives vs false positives guide metric choice
Mathematical Relationship: F2 will always be between recall and F1 when recall > precision
Edge Cases: When TP+FP=0 or TP+FN=0, the respective precision or recall becomes 0, making F-scores = 0