Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Zijian Zhang; Chang Shu; Ya Xiao; Yuan Shen; Di Zhu; Jing Xiao; Youxin Chen; Jey Han Lau; Qian Zhang; Zheng Lu

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Zijian Zhang, Chang Shu, Ya Xiao, Yuan Shen, Di Zhu, Jing Xiao, Youxin Chen, Jey Han Lau, Qian Zhang, Zheng Lu

School of Computer Science

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

Abstract

Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad.

Original language	English
Title of host publication	EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
Publisher	Association for Computational Linguistics (ACL)
Pages	1209-1221
Number of pages	13
ISBN (Electronic)	9781959429449
Publication status	Published - 2023
Event	17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Dubrovnik, Croatia Duration: 2 May 2023 → 6 May 2023

Publication series

Name	EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference	17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/Territory	Croatia
City	Dubrovnik
Period	2/05/23 → 6/05/23

ASJC Scopus subject areas

Computational Theory and Mathematics
Software
Linguistics and Language

Cite this

Zhang, Z., Shu, C., Xiao, Y., Shen, Y., Zhu, D., Xiao, J., Chen, Y., Lau, J. H., Zhang, Q., & Lu, Z. (2023). Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 1209-1221). (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference). Association for Computational Linguistics (ACL).

Zhang, Zijian ; Shu, Chang ; Xiao, Ya et al. / Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective. EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2023. pp. 1209-1221 (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference).

@inproceedings{fe1f5daf38e544d8a5449545cc5b0440,

title = "Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective",

abstract = "Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad.",

author = "Zijian Zhang and Chang Shu and Ya Xiao and Yuan Shen and Di Zhu and Jing Xiao and Youxin Chen and Lau, {Jey Han} and Qian Zhang and Zheng Lu",

note = "Publisher Copyright: {\textcopyright} 2023 Association for Computational Linguistics.; 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 ; Conference date: 02-05-2023 Through 06-05-2023",

year = "2023",

language = "English",

series = "EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference",

publisher = "Association for Computational Linguistics (ACL)",

pages = "1209--1221",

booktitle = "EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference",

address = "United States",

}

Zhang, Z, Shu, C, Xiao, Y, Shen, Y, Zhu, D, Xiao, J, Chen, Y, Lau, JH, Zhang, Q & Lu, Z 2023, Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective. in EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, Association for Computational Linguistics (ACL), pp. 1209-1221, 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, 2/05/23.

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective. / Zhang, Zijian; Shu, Chang; Xiao, Ya et al.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2023. p. 1209-1221 (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

AU - Zhang, Zijian

AU - Shu, Chang

AU - Xiao, Ya

AU - Shen, Yuan

AU - Zhu, Di

AU - Xiao, Jing

AU - Chen, Youxin

AU - Lau, Jey Han

AU - Zhang, Qian

AU - Lu, Zheng

PY - 2023

Y1 - 2023

N2 - Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad.

AB - Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad.

UR - http://www.scopus.com/inward/record.url?scp=85159861686&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85159861686

T3 - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

SP - 1209

EP - 1221

BT - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

T2 - 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023

Y2 - 2 May 2023 through 6 May 2023

ER -

Zhang Z, Shu C, Xiao Y, Shen Y, Zhu D, Xiao J et al. Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2023. p. 1209-1221. (EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference).

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this