MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

Chang Shu; Yi Zhu; Xiaochu Tang; Jing Xiao; Youxin Chen; Xiu Li; Qian Zhang; Zheng Lu

doi:10.1016/j.eswa.2023.121526

MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

Chang Shu, Yi Zhu, Xiaochu Tang, Jing Xiao, Youxin Chen, Xiu Li, Qian Zhang, Zheng Lu

School of Computer Science

Research output: Journal Publication › Article › peer-review

13 Citations (Scopus)

Abstract

Recently multimodal medical pretraining models play a significant role in automatic medical image and text analysis that has wide social and economical impact in healthcare. Despite being able to be quickly transferred to downstream tasks, the models are greatly limited due to the fact that these models can only be pretrained with professional medical image–text datasets, which usually contain a very small number of samples. In this work we propose MITER (Medical Image–Text Joint adaptive Pretraining), a joint adaptive pretraining framework via multi-level contrastive learning to overcome this limitation by pretraining image and text models for medical domain and utilizing existing models pretrained on generic data, which contain enormous number of samples. MITER features two types of objectives to solve the problem. The first type is uni-modal objectives that pretrain the models with medical images and text separately on uni-modal tasks. The other type is a cross-modal objective that pretrains jointly, allowing the models to influence each other on cross-modal tasks. We also introduce a strategy to dynamically select hard negative samples during the training process for better performance. Experimental results over four medical tasks, image-report retrieval, multi-label image classification, visual question answering, and report generation, show that our MITER framework solves the limitation problem by greatly outperforming existing benchmark models on all the tasks. The source code of our framework is available online.

Original language	English
Article number	121526
Journal	Expert Systems with Applications
Volume	238
DOIs	https://doi.org/10.1016/j.eswa.2023.121526
Publication status	Published - 15 Mar 2024

Keywords

Adaptive pretraining
Contrastive learning
Cross-modal pretraining
Medical pretrained model
Self-supervised learning

ASJC Scopus subject areas

General Engineering
Computer Science Applications
Artificial Intelligence

Access to Document

10.1016/j.eswa.2023.121526

Cite this

@article{ccd6ae58c8a5488ca409cb5db92152bf,

title = "MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning",

abstract = "Recently multimodal medical pretraining models play a significant role in automatic medical image and text analysis that has wide social and economical impact in healthcare. Despite being able to be quickly transferred to downstream tasks, the models are greatly limited due to the fact that these models can only be pretrained with professional medical image–text datasets, which usually contain a very small number of samples. In this work we propose MITER (Medical Image–Text Joint adaptive Pretraining), a joint adaptive pretraining framework via multi-level contrastive learning to overcome this limitation by pretraining image and text models for medical domain and utilizing existing models pretrained on generic data, which contain enormous number of samples. MITER features two types of objectives to solve the problem. The first type is uni-modal objectives that pretrain the models with medical images and text separately on uni-modal tasks. The other type is a cross-modal objective that pretrains jointly, allowing the models to influence each other on cross-modal tasks. We also introduce a strategy to dynamically select hard negative samples during the training process for better performance. Experimental results over four medical tasks, image-report retrieval, multi-label image classification, visual question answering, and report generation, show that our MITER framework solves the limitation problem by greatly outperforming existing benchmark models on all the tasks. The source code of our framework is available online.",

keywords = "Adaptive pretraining, Contrastive learning, Cross-modal pretraining, Medical pretrained model, Self-supervised learning",

author = "Chang Shu and Yi Zhu and Xiaochu Tang and Jing Xiao and Youxin Chen and Xiu Li and Qian Zhang and Zheng Lu",

note = "Publisher Copyright: {\textcopyright} 2023",

year = "2024",

month = mar,

day = "15",

doi = "10.1016/j.eswa.2023.121526",

language = "English",

volume = "238",

journal = "Expert Systems with Applications",

issn = "0957-4174",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - MITER

T2 - Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

AU - Shu, Chang

AU - Zhu, Yi

AU - Tang, Xiaochu

AU - Xiao, Jing

AU - Chen, Youxin

AU - Li, Xiu

AU - Zhang, Qian

AU - Lu, Zheng

PY - 2024/3/15

Y1 - 2024/3/15

N2 - Recently multimodal medical pretraining models play a significant role in automatic medical image and text analysis that has wide social and economical impact in healthcare. Despite being able to be quickly transferred to downstream tasks, the models are greatly limited due to the fact that these models can only be pretrained with professional medical image–text datasets, which usually contain a very small number of samples. In this work we propose MITER (Medical Image–Text Joint adaptive Pretraining), a joint adaptive pretraining framework via multi-level contrastive learning to overcome this limitation by pretraining image and text models for medical domain and utilizing existing models pretrained on generic data, which contain enormous number of samples. MITER features two types of objectives to solve the problem. The first type is uni-modal objectives that pretrain the models with medical images and text separately on uni-modal tasks. The other type is a cross-modal objective that pretrains jointly, allowing the models to influence each other on cross-modal tasks. We also introduce a strategy to dynamically select hard negative samples during the training process for better performance. Experimental results over four medical tasks, image-report retrieval, multi-label image classification, visual question answering, and report generation, show that our MITER framework solves the limitation problem by greatly outperforming existing benchmark models on all the tasks. The source code of our framework is available online.

AB - Recently multimodal medical pretraining models play a significant role in automatic medical image and text analysis that has wide social and economical impact in healthcare. Despite being able to be quickly transferred to downstream tasks, the models are greatly limited due to the fact that these models can only be pretrained with professional medical image–text datasets, which usually contain a very small number of samples. In this work we propose MITER (Medical Image–Text Joint adaptive Pretraining), a joint adaptive pretraining framework via multi-level contrastive learning to overcome this limitation by pretraining image and text models for medical domain and utilizing existing models pretrained on generic data, which contain enormous number of samples. MITER features two types of objectives to solve the problem. The first type is uni-modal objectives that pretrain the models with medical images and text separately on uni-modal tasks. The other type is a cross-modal objective that pretrains jointly, allowing the models to influence each other on cross-modal tasks. We also introduce a strategy to dynamically select hard negative samples during the training process for better performance. Experimental results over four medical tasks, image-report retrieval, multi-label image classification, visual question answering, and report generation, show that our MITER framework solves the limitation problem by greatly outperforming existing benchmark models on all the tasks. The source code of our framework is available online.

KW - Adaptive pretraining

KW - Contrastive learning

KW - Cross-modal pretraining

KW - Medical pretrained model

KW - Self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85175098202&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2023.121526

DO - 10.1016/j.eswa.2023.121526

M3 - Article

AN - SCOPUS:85175098202

SN - 0957-4174

VL - 238

JO - Expert Systems with Applications

JF - Expert Systems with Applications

M1 - 121526

ER -

MITER: Medical Image–TExt joint adaptive pretRaining with multi-level contrastive learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this