F1 and F2 Score Metrics¶
The f1_score and f2_score metrics calculate F-beta scores, which are weighted harmonic means of precision and recall. These metrics provide comprehensive measures of classification performance with different emphasis on precision versus recall.
Metric Types: f1_score, f2_score
F-Score Calculation¶
F-scores are calculated using the general F-beta formula: (1 + β²) * (precision * recall) / (β² * precision + recall)
Where:
- F1 Score (β=1):
2 * (precision * recall) / (precision + recall)- balanced measure - F2 Score (β=2):
5 * (precision * recall) / (4 * precision + recall)- recall-weighted measure - precision = TP / (TP + FP) - fraction of predicted positives that are actually positive
- recall = TP / (TP + FN) - fraction of actual positives that are correctly predicted
- TP = True Positives, FP = False Positives, FN = False Negatives
Both scores range from 0 to 1, where 1 indicates perfect classification performance.
Key Differences¶
| Metric | Beta (β) | Focus | Use Case |
|---|---|---|---|
| F1 Score | 1.0 | Balanced precision and recall | General classification, equal cost of FP and FN |
| F2 Score | 2.0 | Emphasizes recall over precision | High cost of false negatives (medical, fraud, safety) |
Configuration Fields¶
Record-Level Data Format¶
For individual observation records with probabilities and binary outcomes:
collections:
model_f1:
metrics:
- name:
- classification_f1
data_format: record
prob_def: probability
default: actual_outcome
threshold: 0.5
segment:
- - model_version
metric_type: f1_score
dataset: predictions
model_f2:
metrics:
- name:
- classification_f2
data_format: record
prob_def: probability
default: actual_outcome
threshold: 0.3
segment:
- - model_version
metric_type: f2_score
dataset: predictions
Summary-Level Data Format¶
For pre-aggregated data grouped by risk buckets:
collections:
summary_f1:
metrics:
- name:
- aggregated_f1
data_format: summary
mean_pd: mean_probability
defaults: default_count
volume: observation_count
threshold: 0.5
segment:
- - data_source
metric_type: f1_score
dataset: risk_buckets
summary_f2:
metrics:
- name:
- aggregated_f2
data_format: summary
mean_pd: mean_probability
defaults: default_count
volume: observation_count
threshold: 0.4
segment:
- - data_source
metric_type: f2_score
dataset: risk_buckets
Required Fields by Format¶
Record-Level Required¶
name: Metric name(s)data_format: Must be "record"prob_def: Predicted probability column name (values between 0.0 and 1.0)default: Binary outcome column name (values 0 or 1)dataset: Dataset reference
Summary-Level Required¶
name: Metric name(s)data_format: Must be "summary"mean_pd: Mean probability column name (values between 0.0 and 1.0)defaults: Default count column name (positive numbers)volume: Volume count column name (positive numbers)dataset: Dataset reference
Optional Fields¶
threshold: Classification threshold (default: 0.5, range: 0.0-1.0)segment: List of column names for grouping
Output Columns¶
Both metrics produce the following output columns:
group_key: Segmentation group identifier (struct of segment values)volume: Total number of observationsdefaults: Total number of actual positivesodr: Observed Default Rate (defaults/volume)pd: Mean predicted probabilityprecision: Precision score (TP/(TP+FP))recall: Recall score (TP/(TP+FN))f_score: F1 or F2 score value (depending on metric type)tp: True Positives countfp: False Positives countfn: False Negatives count
Fan-out Examples¶
Comparative Analysis¶
collections:
f_score_comparison:
metrics:
- name:
- balanced_f1
- regional_f1
data_format: record
prob_def: risk_score
default: default_flag
threshold: 0.5
segment:
- []
- - region
metric_type: f1_score
dataset: validation_data
recall_focused:
metrics:
- name:
- recall_f2
- regional_f2
data_format: record
prob_def: risk_score
default: default_flag
threshold: 0.3
segment:
- []
- - region
metric_type: f2_score
dataset: validation_data
Mixed Data Formats¶
collections:
detailed_scores:
metrics:
- name:
- record_f1
data_format: record
prob_def: probability
default: outcome
threshold: 0.5
metric_type: f1_score
dataset: detailed_data
summary_scores:
metrics:
- name:
- summary_f2
data_format: summary
mean_pd: avg_probability
defaults: default_count
volume: total_count
threshold: 0.3
metric_type: f2_score
dataset: summary_data
Data Requirements¶
Record-Level Data¶
- One row per observation
- Probability column: numeric values between 0.0 and 1.0
- Default column: binary values (0 or 1)
- No missing values in key columns
Summary-Level Data¶
- One row per risk bucket/group
- Mean probability: numeric values between 0.0 and 1.0
- Default counts: non-negative integers
- Volume counts: positive integers
- Defaults should not exceed volume for any bucket
Score Interpretation¶
Value Guidelines¶
- 1.0: Perfect classification (both precision and recall = 1.0)
- 0.8-1.0: Excellent classification performance
- 0.6-0.8: Good classification performance
- 0.4-0.6: Fair classification performance
- 0.0-0.4: Poor classification performance
F1 vs F2 Behavior¶
When recall > precision: - F2 score will be higher than F1 score - F2 moves closer to the recall value - F1 provides a balanced middle ground
When precision > recall: - F1 score will be higher than F2 score - F2 penalizes low recall more heavily - F1 provides better balance for this scenario
Use Case Selection¶
Choose F1 Score When:¶
- False positives and false negatives have similar costs
- You need a balanced assessment of model performance
- General classification tasks
- Model comparison across different domains
- Benchmark reporting where balance is preferred
Choose F2 Score When:¶
- False negatives are more costly than false positives
- Medical diagnosis: Missing disease is worse than false alarms
- Fraud detection: Missing fraud costs more than false investigations
- Safety systems: Failing to detect danger has severe consequences
- Quality control: Missing defects is expensive
- Customer churn: Missing at-risk customers costs more than false interventions
Threshold Optimization¶
F1 Score Thresholds¶
- Often optimized around 0.4-0.6 range
- Balances precision and recall trade-offs
- Consider ROC curve analysis for selection
F2 Score Thresholds¶
- Often optimized at lower values (0.2-0.4)
- Prioritizes higher recall
- Accept more false positives for better recall
Important Notes¶
- Metric Selection: Choose F1 for balanced evaluation, F2 when recall is critical
- Threshold Impact: F2 often benefits from lower thresholds than F1
- Data Format: Record-level uses individual probabilities; summary-level treats entire buckets as positive/negative
- Summary Logic: In summary-level format, buckets with mean_pd ≥ threshold are treated as predicted positive
- Complementary Analysis: Use both metrics together for comprehensive evaluation
- Business Context: Let the cost of false negatives vs false positives guide metric choice
- Mathematical Relationship: F2 will always be between recall and F1 when recall > precision
- Edge Cases: When TP+FP=0 or TP+FN=0, the respective precision or recall becomes 0, making F-scores = 0