F1 and F2 Score Metrics¶
The f1_score and f2_score metrics calculate F-beta scores, which are weighted harmonic means of precision and recall. These metrics provide comprehensive measures of classification performance with different emphasis on precision versus recall.
Metric Types: f1_score, f2_score
F-Score Calculation¶
F-scores are calculated using the general F-beta formula: (1 + β²) * (precision * recall) / (β² * precision + recall)
Where:
- F1 Score (β=1):
2 * (precision * recall) / (precision + recall)- balanced measure - F2 Score (β=2):
5 * (precision * recall) / (4 * precision + recall)- recall-weighted measure - precision = TP / (TP + FP) - fraction of predicted positives that are actually positive
- recall = TP / (TP + FN) - fraction of actual positives that are correctly predicted
- TP = True Positives, FP = False Positives, FN = False Negatives
Both scores range from 0 to 1, where 1 indicates perfect classification performance.
Key Differences¶
| Metric | Beta (β) | Focus | Use Case |
|---|---|---|---|
| F1 Score | 1.0 | Balanced precision and recall | General classification, equal cost of FP and FN |
| F2 Score | 2.0 | Emphasizes recall over precision | High cost of false negatives (medical, fraud, safety) |
Configuration Fields¶
Record-Level Data Format¶
For individual observation records with probabilities and binary outcomes:
metrics:
model_f1:
metric_type: "f1_score"
config:
name: ["classification_f1"]
data_format: "record_level"
prob_def: "probability" # Column with predicted probabilities (0-1)
default: "actual_outcome" # Column with binary outcomes (0/1)
threshold: 0.5 # Threshold for converting probabilities to predictions
segment: [["model_version"]] # Optional: segmentation columns
dataset: "predictions"
model_f2:
metric_type: "f2_score"
config:
name: ["classification_f2"]
data_format: "record_level"
prob_def: "probability"
default: "actual_outcome"
threshold: 0.3 # Often lower for F2 to maximize recall
segment: [["model_version"]]
dataset: "predictions"
Summary-Level Data Format¶
For pre-aggregated data grouped by risk buckets:
metrics:
summary_f1:
metric_type: "f1_score"
config:
name: ["aggregated_f1"]
data_format: "summary_level"
mean_pd: "mean_probability" # Column with mean probabilities per bucket
defaults: "default_count" # Column with default counts per bucket
volume: "observation_count" # Column with total observations per bucket
threshold: 0.5 # Threshold for treating buckets as positive predictions
segment: [["data_source"]] # Optional: segmentation columns
dataset: "risk_buckets"
summary_f2:
metric_type: "f2_score"
config:
name: ["aggregated_f2"]
data_format: "summary_level"
mean_pd: "mean_probability"
defaults: "default_count"
volume: "observation_count"
threshold: 0.4
segment: [["data_source"]]
dataset: "risk_buckets"
Required Fields by Format¶
Record-Level Required¶
name: Metric name(s)data_format: Must be "record_level"prob_def: Predicted probability column name (values between 0.0 and 1.0)default: Binary outcome column name (values 0 or 1)dataset: Dataset reference
Summary-Level Required¶
name: Metric name(s)data_format: Must be "summary_level"mean_pd: Mean probability column name (values between 0.0 and 1.0)defaults: Default count column name (positive numbers)volume: Volume count column name (positive numbers)dataset: Dataset reference
Optional Fields¶
threshold: Classification threshold (default: 0.5, range: 0.0-1.0)segment: List of column names for grouping
Output Columns¶
Both metrics produce the following output columns:
group_key: Segmentation group identifier (struct of segment values)volume: Total number of observationsdefaults: Total number of actual positivesodr: Observed Default Rate (defaults/volume)pd: Mean predicted probabilityprecision: Precision score (TP/(TP+FP))recall: Recall score (TP/(TP+FN))f_score: F1 or F2 score value (depending on metric type)tp: True Positives countfp: False Positives countfn: False Negatives count
Fan-out Examples¶
Comparative Analysis¶
metrics:
f_score_comparison:
metric_type: "f1_score"
config:
name: ["balanced_f1", "regional_f1"]
data_format: "record_level"
prob_def: "risk_score"
default: "default_flag"
threshold: 0.5
segment: [[], ["region"]]
dataset: "validation_data"
recall_focused:
metric_type: "f2_score"
config:
name: ["recall_f2", "regional_f2"]
data_format: "record_level"
prob_def: "risk_score"
default: "default_flag"
threshold: 0.3 # Lower threshold for better recall
segment: [[], ["region"]]
dataset: "validation_data"
Mixed Data Formats¶
metrics:
detailed_scores:
metric_type: "f1_score"
config:
name: ["record_f1"]
data_format: "record_level"
prob_def: "probability"
default: "outcome"
threshold: 0.5
dataset: "detailed_data"
summary_scores:
metric_type: "f2_score"
config:
name: ["summary_f2"]
data_format: "summary_level"
mean_pd: "avg_probability"
defaults: "default_count"
volume: "total_count"
threshold: 0.3
dataset: "summary_data"
Data Requirements¶
Record-Level Data¶
- One row per observation
- Probability column: numeric values between 0.0 and 1.0
- Default column: binary values (0 or 1)
- No missing values in key columns
Summary-Level Data¶
- One row per risk bucket/group
- Mean probability: numeric values between 0.0 and 1.0
- Default counts: non-negative integers
- Volume counts: positive integers
- Defaults should not exceed volume for any bucket
Score Interpretation¶
Value Guidelines¶
- 1.0: Perfect classification (both precision and recall = 1.0)
- 0.8-1.0: Excellent classification performance
- 0.6-0.8: Good classification performance
- 0.4-0.6: Fair classification performance
- 0.0-0.4: Poor classification performance
F1 vs F2 Behavior¶
When recall > precision: - F2 score will be higher than F1 score - F2 moves closer to the recall value - F1 provides a balanced middle ground
When precision > recall: - F1 score will be higher than F2 score - F2 penalizes low recall more heavily - F1 provides better balance for this scenario
Use Case Selection¶
Choose F1 Score When:¶
- False positives and false negatives have similar costs
- You need a balanced assessment of model performance
- General classification tasks
- Model comparison across different domains
- Benchmark reporting where balance is preferred
Choose F2 Score When:¶
- False negatives are more costly than false positives
- Medical diagnosis: Missing disease is worse than false alarms
- Fraud detection: Missing fraud costs more than false investigations
- Safety systems: Failing to detect danger has severe consequences
- Quality control: Missing defects is expensive
- Customer churn: Missing at-risk customers costs more than false interventions
Threshold Optimization¶
F1 Score Thresholds¶
- Often optimized around 0.4-0.6 range
- Balances precision and recall trade-offs
- Consider ROC curve analysis for selection
F2 Score Thresholds¶
- Often optimized at lower values (0.2-0.4)
- Prioritizes higher recall
- Accept more false positives for better recall
Important Notes¶
- Metric Selection: Choose F1 for balanced evaluation, F2 when recall is critical
- Threshold Impact: F2 often benefits from lower thresholds than F1
- Data Format: Record-level uses individual probabilities; summary-level treats entire buckets as positive/negative
- Summary Logic: In summary-level format, buckets with mean_pd ≥ threshold are treated as predicted positive
- Complementary Analysis: Use both metrics together for comprehensive evaluation
- Business Context: Let the cost of false negatives vs false positives guide metric choice
- Mathematical Relationship: F2 will always be between recall and F1 when recall > precision
- Edge Cases: When TP+FP=0 or TP+FN=0, the respective precision or recall becomes 0, making F-scores = 0