Hi, I’m working on fine-tuning a CLIP-style VLM for an image-text retrieval task in the medical domain (chest radiology), using French reports. I’m running into consistently poor results and would really appreciate some guidance.
Some context :
- Task: multimodal retrieval (image-text)
- chest X-rays (radiology)
- French (translated from English datasets mainly Rexgradient and some CheXpert)
- 18k image-text pairs, balanced across categories
- Models tested: MedCLIP, BiomedCLIP
- Hardware: 4×2080 Ti (planning to scale batch size later on stronger hardware)
What I tried:
- Full fine-tuning → strong overfitting, very poor Recall
- Swapping the text encoder with a French one
- Basic preprocessing, but translations are likely noisy/inconsistent
Current issues:
- Recall results are very low
- model overfitting
Suspected bottlenecks:
- Translation noise (especially medical terminology + negations)
- Limited dataset size
- Fine-tuning strategy not optimal
Questions:
- Is the translation the biggest bottleneck here? how important is a perfect translation in getting good results.
- is it better to:
- focus on cleaning/filtering data
- or scale up via more translated data (even if noisy)? - What fine-tuning strategy would you recommend here? (freezing, partial FT, adapters, LoRa etc.)
- Are there better starting points for multilingual/medical VLMs than MedCLIP/BiomedCLIP?
Any advice or recommendations would be super helpful. Thanks!