u/ManyAggravating7778

Hi, I’m working on fine-tuning a CLIP-style VLM for an image-text retrieval task in the medical domain (chest radiology), using French reports. I’m running into consistently poor results and would really appreciate some guidance.

Some context :
- Task: multimodal retrieval (image-text)
- chest X-rays (radiology)
- French (translated from English datasets mainly Rexgradient and some CheXpert)
- 18k image-text pairs, balanced across categories
- Models tested: MedCLIP, BiomedCLIP
- Hardware: 4×2080 Ti (planning to scale batch size later on stronger hardware)

What I tried:
- Full fine-tuning → strong overfitting, very poor Recall
- Swapping the text encoder with a French one
- Basic preprocessing, but translations are likely noisy/inconsistent

Current issues:
- Recall results are very low
- model overfitting

Suspected bottlenecks:
- Translation noise (especially medical terminology + negations)
- Limited dataset size
- Fine-tuning strategy not optimal

Questions:

  1. Is the translation the biggest bottleneck here? how important is a perfect translation in getting good results.
  2. is it better to:
    - focus on cleaning/filtering data
    - or scale up via more translated data (even if noisy)?
  3. What fine-tuning strategy would you recommend here? (freezing, partial FT, adapters, LoRa etc.)
  4. Are there better starting points for multilingual/medical VLMs than MedCLIP/BiomedCLIP?

Any advice or recommendations would be super helpful. Thanks!

reddit.com
u/ManyAggravating7778 — 10 days ago