TY - GEN
T1 - Distribution-Based Masked Medical Vision-Language Model Using Structured Reports
AU - Gowda, Shreyank N.
AU - Zhang, Ruichi
AU - Gu, Xiao
AU - Weng, Ying
AU - Yang, Lu
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Medical image-language pre-training aims to align medical images with clinically relevant text to improve model performance on various downstream tasks. However, existing models often struggle with the variability and ambiguity inherent in medical data, limiting their ability to capture nuanced clinical information and uncertainty. This work introduces an uncertainty-aware medical image-text pre-training model that enhances generalization capabilities in medical image analysis. Building on previous methods and focusing on Chest X-Rays, our approach utilizes structured text reports generated by a large language model (LLM) to augment image data with clinically relevant context. These reports begin with a definition of the disease, followed by the ‘appearance’ section to highlight critical regions of interest, and finally ‘observations’ and ‘verdicts’ that ground model predictions in clinical semantics. By modeling both inter- and intra-modal uncertainty, our framework captures the inherent ambiguity in medical images and text, yielding improved representations and performance on downstream tasks. Our model demonstrates significant advances in medical image-text pre-training, obtaining state-of-the-art performance on multiple downstream tasks.
AB - Medical image-language pre-training aims to align medical images with clinically relevant text to improve model performance on various downstream tasks. However, existing models often struggle with the variability and ambiguity inherent in medical data, limiting their ability to capture nuanced clinical information and uncertainty. This work introduces an uncertainty-aware medical image-text pre-training model that enhances generalization capabilities in medical image analysis. Building on previous methods and focusing on Chest X-Rays, our approach utilizes structured text reports generated by a large language model (LLM) to augment image data with clinically relevant context. These reports begin with a definition of the disease, followed by the ‘appearance’ section to highlight critical regions of interest, and finally ‘observations’ and ‘verdicts’ that ground model predictions in clinical semantics. By modeling both inter- and intra-modal uncertainty, our framework captures the inherent ambiguity in medical images and text, yielding improved representations and performance on downstream tasks. Our model demonstrates significant advances in medical image-text pre-training, obtaining state-of-the-art performance on multiple downstream tasks.
KW - Chest X-Ray
KW - Uncertainty
KW - Vision-Language
UR - https://www.scopus.com/pages/publications/105031299414
U2 - 10.1007/978-3-032-17611-0_1
DO - 10.1007/978-3-032-17611-0_1
M3 - Conference contribution
AN - SCOPUS:105031299414
SN - 9783032176103
T3 - Lecture Notes in Computer Science
SP - 1
EP - 11
BT - Interpretability of Machine Intelligence in Medical Image Computing - 8th International Workshop, iMIMIC 2025, Held in Conjunction with MICCAI 2025, Proceedings
A2 - Reyes, Mauricio
A2 - Henriques Abreu, Pedro
A2 - Cardoso, Jaime
PB - Springer Science and Business Media Deutschland GmbH
T2 - 8th International Workshop on Interpretability of Machine Intelligence in Medical Image Computing, iMIMIC 2025, held in conjunction with the 28th International Conference on Medical Imaging and Computer-Assisted Intervention, MICCAI 2025
Y2 - 27 September 2025 through 27 September 2025
ER -