MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

Chang Shu, Yi Zhu, Xiaochu Tang, Jing Xiao, Youxin Chen, Xiu Li, Qian Zhang, Zheng Lu

Research output: Journal PublicationArticlepeer-review

1 Citation (Scopus)


Recently multimodal medical pretraining models play a significant role in automatic medical image and text analysis that has wide social and economical impact in healthcare. Despite being able to be quickly transferred to downstream tasks, the models are greatly limited due to the fact that these models can only be pretrained with professional medical image–text datasets, which usually contain a very small number of samples. In this work we propose MITER (Medical Image–Text Joint adaptive Pretraining), a joint adaptive pretraining framework via multi-level contrastive learning to overcome this limitation by pretraining image and text models for medical domain and utilizing existing models pretrained on generic data, which contain enormous number of samples. MITER features two types of objectives to solve the problem. The first type is uni-modal objectives that pretrain the models with medical images and text separately on uni-modal tasks. The other type is a cross-modal objective that pretrains jointly, allowing the models to influence each other on cross-modal tasks. We also introduce a strategy to dynamically select hard negative samples during the training process for better performance. Experimental results over four medical tasks, image-report retrieval, multi-label image classification, visual question answering, and report generation, show that our MITER framework solves the limitation problem by greatly outperforming existing benchmark models on all the tasks. The source code of our framework is available online.

Original languageEnglish
Article number121526
JournalExpert Systems with Applications
Publication statusPublished - 15 Mar 2024


  • Adaptive pretraining
  • Contrastive learning
  • Cross-modal pretraining
  • Medical pretrained model
  • Self-supervised learning

ASJC Scopus subject areas

  • General Engineering
  • Computer Science Applications
  • Artificial Intelligence


Dive into the research topics of 'MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning'. Together they form a unique fingerprint.

Cite this