u/AverageJoe2k — reddlx

Edit : it was not a negative domain shift, this is an artifact due to poor prompting of the LLM I’ve used.

Hi everyone,

I’ve been working on a multi-modal ECG foundation model (Diagnostic + Segmentation) and just finished the final benchmark phase. The results are hitting numbers that feel like SOTA, but I’d love some "sanity check" feedback from people who specialize in medical AI or signal processing.

The Setup:

To ensure these weren't "hallucinated" benchmarks, I used a Tri-Vault validation strategy:

LUDB (Structural): Used strictly for U-Net waveform segmentation precision.

The Results:

======================================================================

V11 DEFINITIVE CLINICAL IMPACT REPORT

======================================================================

[1] PRIMARY DIAGNOSTIC (CinC Test Set)

Macro AUC : 0.8896 | Micro AUC : 0.9267

Macro F1 : 0.3687 | Micro F1 : 0.5723

[2] DOMAIN GENERALIZATION (MIMIC Holdout Set)

Macro AUC : 0.9195 | Micro AUC : 0.9629

Macro F1 : 0.4364 | Micro F1 : 0.6733

[3] STRUCTURAL PRECISION (LUDB Test Set)

Foreground Dice : 0.9531

======================================================================

precision recall f1-score support

NORM 0.69 0.80 0.74 17963

AFIB 0.87 0.59 0.70 5246

AFLT 0.79 0.14 0.24 904

PAC 0.70 0.59 0.64 2607

PVC 0.86 0.72 0.78 3176

LBBB 0.90 0.75 0.81 2263

RBBB 0.91 0.78 0.84 4266

1AVB 0.66 0.72 0.69 4373

2AVB 0.06 0.33 0.10 12

3AVB 0.00 0.00 0.00 0

AMI 0.73 0.47 0.57 6645

ISCH 0.74 0.42 0.54 6553

IRBBB 0.00 0.00 0.00 1530

LAnFB 0.90 0.72 0.80 5774

BRADY 0.91 0.86 0.89 8432

TACHY 0.92 0.90 0.91 5685

LPR 0.19 0.10 0.13 882

QAB 0.00 0.00 0.00 34

TAB 0.48 0.36 0.41 9925

TINV 0.00 0.00 0.00 11

STE 0.53 0.28 0.37 624

STD 0.00 0.00 0.00 0

WPW 0.43 0.39 0.41 59

LVH 0.78 0.37 0.51 4468

RVH 0.30 0.22 0.25 400

VFLT 0.00 0.00 0.00 0

LQRSV 0.55 0.36 0.43 4643

* classes with 0 representatives in test fold / dataset were taken into account when calculating these metrics - 3 classes in total.

The "Metric Gap" Observation:

While the Micro AUC (0.9629) suggests high ranking power, the Macro F1 (0.4364) reveals the model is struggling significantly with minority class recall. For example, 2nd-Degree AV Block (2AVB) sits at 0.10 F1, while Tachycardia is at 0.91.

The model shows a clear "Home-Domain Bias"—it performs better on the noisy ICU data (MIMIC) than on the curated clinical set (CinC), likely because the training distribution was heavily weighted toward MIMIC.

The Disclosure:

I’m not from this field, so I’m trying to distinguish between "strong baseline results" and "over-optimistic artifacts."

————————————

Everything above this line has been written by a LLM.

Questions :

1. How can it achieve such a high auc-roc yet such a low mean F1?

How would you tackle the extremely low F1 classes with really low representation in the dataset? Should they be excluded?
I’m not really sure if these values are truly competitive or just overhyped by an LLM so some clarity/feedback would be nice.

*micro auc is most likely inflated by the majority classes

Plz help because I don’t want to suffer from ai caused delirium. Thank you for your time!