Edit : it was not a negative domain shift, this is an artifact due to poor prompting of the LLM I’ve used.
Hi everyone,
I’ve been working on a multi-modal ECG foundation model (Diagnostic + Segmentation) and just finished the final benchmark phase. The results are hitting numbers that feel like SOTA, but I’d love some "sanity check" feedback from people who specialize in medical AI or signal processing.
The Setup:
To ensure these weren't "hallucinated" benchmarks, I used a Tri-Vault validation strategy:
- LUDB (Structural): Used strictly for U-Net waveform segmentation precision.
The Results:
======================================================================
V11 DEFINITIVE CLINICAL IMPACT REPORT
======================================================================
[1] PRIMARY DIAGNOSTIC (CinC Test Set)
Macro AUC : 0.8896 | Micro AUC : 0.9267
Macro F1 : 0.3687 | Micro F1 : 0.5723
[2] DOMAIN GENERALIZATION (MIMIC Holdout Set)
Macro AUC : 0.9195 | Micro AUC : 0.9629
Macro F1 : 0.4364 | Micro F1 : 0.6733
[3] STRUCTURAL PRECISION (LUDB Test Set)
Foreground Dice : 0.9531
======================================================================
precision recall f1-score support
NORM 0.69 0.80 0.74 17963
AFIB 0.87 0.59 0.70 5246
AFLT 0.79 0.14 0.24 904
PAC 0.70 0.59 0.64 2607
PVC 0.86 0.72 0.78 3176
LBBB 0.90 0.75 0.81 2263
RBBB 0.91 0.78 0.84 4266
1AVB 0.66 0.72 0.69 4373
2AVB 0.06 0.33 0.10 12
3AVB 0.00 0.00 0.00 0
AMI 0.73 0.47 0.57 6645
ISCH 0.74 0.42 0.54 6553
IRBBB 0.00 0.00 0.00 1530
LAnFB 0.90 0.72 0.80 5774
BRADY 0.91 0.86 0.89 8432
TACHY 0.92 0.90 0.91 5685
LPR 0.19 0.10 0.13 882
QAB 0.00 0.00 0.00 34
TAB 0.48 0.36 0.41 9925
TINV 0.00 0.00 0.00 11
STE 0.53 0.28 0.37 624
STD 0.00 0.00 0.00 0
WPW 0.43 0.39 0.41 59
LVH 0.78 0.37 0.51 4468
RVH 0.30 0.22 0.25 400
VFLT 0.00 0.00 0.00 0
LQRSV 0.55 0.36 0.43 4643
* classes with 0 representatives in test fold / dataset were taken into account when calculating these metrics - 3 classes in total.
The "Metric Gap" Observation:
While the Micro AUC (0.9629) suggests high ranking power, the Macro F1 (0.4364) reveals the model is struggling significantly with minority class recall. For example, 2nd-Degree AV Block (2AVB) sits at 0.10 F1, while Tachycardia is at 0.91.
The model shows a clear "Home-Domain Bias"—it performs better on the noisy ICU data (MIMIC) than on the curated clinical set (CinC), likely because the training distribution was heavily weighted toward MIMIC.
The Disclosure:
I’m not from this field, so I’m trying to distinguish between "strong baseline results" and "over-optimistic artifacts."
————————————
Everything above this line has been written by a LLM.
Questions :
1. How can it achieve such a high auc-roc yet such a low mean F1?
How would you tackle the extremely low F1 classes with really low representation in the dataset? Should they be excluded?
I’m not really sure if these values are truly competitive or just overhyped by an LLM so some clarity/feedback would be nice.
*micro auc is most likely inflated by the majority classes
Plz help because I don’t want to suffer from ai caused delirium. Thank you for your time!