TY - JOUR
T1 - CLIMS++
T2 - Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation
AU - Xie, Jinheng
AU - Deng, Songhe
AU - Hou, Xianxu
AU - Luo, Zhaochuan
AU - Shen, Linlin
AU - Huang, Yawen
AU - Zheng, Yefeng
AU - Shou, Mike Zheng
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/8
Y1 - 2025/8
N2 - While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.
AB - While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.
KW - Multi-modal learning
KW - Semantic segmentation
KW - Weakly-supervised learning
UR - https://www.scopus.com/pages/publications/105004450811
U2 - 10.1007/s11263-025-02442-2
DO - 10.1007/s11263-025-02442-2
M3 - Article
AN - SCOPUS:105004450811
SN - 0920-5691
VL - 133
SP - 5569
EP - 5588
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
IS - 8
ER -