ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization

Zijian Zhang; Chang Shu; Youxin Chen; Jing Xiao; Qian Zhang; Lu Zheng

doi:10.1109/IJCNN55064.2022.9892884

ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization

Zijian Zhang, Chang Shu, Youxin Chen, Jing Xiao, Qian Zhang, Lu Zheng

School of Computer Science

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

7 Citations (Scopus)

Abstract

Integrating multimodal knowledge for abstractive summarization task is a work-in-progress research area, with present techniques inheriting fusion-then-generation paradigm. Due to semantic gaps between computer vision and natural language processing, current methods often treat multiple data points as separate objects and rely on attention mechanisms to search for connection in order to fuse together. In addition, missing awareness of cross-modal matching from many frameworks leads to performance reduction. To solve these two drawbacks, we propose an Iterative Contrastive Alignment Framework (ICAF) that uses recurrent alignment and contrast to capture the coherences between images and texts. Specifically, we design a recurrent alignment (RA) layer to gradually investigate fine-grained semantical relationships between image patches and text tokens. At each step during the encoding process, crossmodal contrastive losses are applied to directly optimize the embedding space. According to ROUGE, relevance scores, and human evaluation, our model outperforms the state-of-the-art baselines on MSMO dataset. Experiments on the applicability of our proposed framework and hyperparameters settings have been also conducted.

Original language	English
Title of host publication	2022 International Joint Conference on Neural Networks, IJCNN 2022 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781728186719
DOIs	https://doi.org/10.1109/IJCNN55064.2022.9892884
Publication status	Published - 2022
Event	2022 International Joint Conference on Neural Networks, IJCNN 2022 - Padua, Italy Duration: 18 Jul 2022 → 23 Jul 2022

Publication series

Name	Proceedings of the International Joint Conference on Neural Networks
Volume	2022-July

Conference

Conference	2022 International Joint Conference on Neural Networks, IJCNN 2022
Country/Territory	Italy
City	Padua
Period	18/07/22 → 23/07/22

Keywords

contrastive learning
multimodal abstractive summarization
recurrent alignment

ASJC Scopus subject areas

Software
Artificial Intelligence

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1109/IJCNN55064.2022.9892884

Cite this

Zhang, Z., Shu, C., Chen, Y., Xiao, J., Zhang, Q., & Zheng, L. (2022). ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization. In 2022 International Joint Conference on Neural Networks, IJCNN 2022 - Proceedings (Proceedings of the International Joint Conference on Neural Networks; Vol. 2022-July). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IJCNN55064.2022.9892884

@inproceedings{e2034511257047d384e35f7644c728e0,

title = "ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization",

abstract = "Integrating multimodal knowledge for abstractive summarization task is a work-in-progress research area, with present techniques inheriting fusion-then-generation paradigm. Due to semantic gaps between computer vision and natural language processing, current methods often treat multiple data points as separate objects and rely on attention mechanisms to search for connection in order to fuse together. In addition, missing awareness of cross-modal matching from many frameworks leads to performance reduction. To solve these two drawbacks, we propose an Iterative Contrastive Alignment Framework (ICAF) that uses recurrent alignment and contrast to capture the coherences between images and texts. Specifically, we design a recurrent alignment (RA) layer to gradually investigate fine-grained semantical relationships between image patches and text tokens. At each step during the encoding process, crossmodal contrastive losses are applied to directly optimize the embedding space. According to ROUGE, relevance scores, and human evaluation, our model outperforms the state-of-the-art baselines on MSMO dataset. Experiments on the applicability of our proposed framework and hyperparameters settings have been also conducted.",

keywords = "contrastive learning, multimodal abstractive summarization, recurrent alignment",

author = "Zijian Zhang and Chang Shu and Youxin Chen and Jing Xiao and Qian Zhang and Lu Zheng",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 International Joint Conference on Neural Networks, IJCNN 2022 ; Conference date: 18-07-2022 Through 23-07-2022",

year = "2022",

doi = "10.1109/IJCNN55064.2022.9892884",

language = "English",

series = "Proceedings of the International Joint Conference on Neural Networks",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2022 International Joint Conference on Neural Networks, IJCNN 2022 - Proceedings",

address = "United States",

}

Zhang, Z, Shu, C, Chen, Y, Xiao, J, Zhang, Q & Zheng, L 2022, ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization. in 2022 International Joint Conference on Neural Networks, IJCNN 2022 - Proceedings. Proceedings of the International Joint Conference on Neural Networks, vol. 2022-July, Institute of Electrical and Electronics Engineers Inc., 2022 International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 18/07/22. https://doi.org/10.1109/IJCNN55064.2022.9892884

ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization. / Zhang, Zijian; Shu, Chang; Chen, Youxin et al.
2022 International Joint Conference on Neural Networks, IJCNN 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. (Proceedings of the International Joint Conference on Neural Networks; Vol. 2022-July).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - ICAF

T2 - 2022 International Joint Conference on Neural Networks, IJCNN 2022

AU - Zhang, Zijian

AU - Shu, Chang

AU - Chen, Youxin

AU - Xiao, Jing

AU - Zhang, Qian

AU - Zheng, Lu

PY - 2022

Y1 - 2022

N2 - Integrating multimodal knowledge for abstractive summarization task is a work-in-progress research area, with present techniques inheriting fusion-then-generation paradigm. Due to semantic gaps between computer vision and natural language processing, current methods often treat multiple data points as separate objects and rely on attention mechanisms to search for connection in order to fuse together. In addition, missing awareness of cross-modal matching from many frameworks leads to performance reduction. To solve these two drawbacks, we propose an Iterative Contrastive Alignment Framework (ICAF) that uses recurrent alignment and contrast to capture the coherences between images and texts. Specifically, we design a recurrent alignment (RA) layer to gradually investigate fine-grained semantical relationships between image patches and text tokens. At each step during the encoding process, crossmodal contrastive losses are applied to directly optimize the embedding space. According to ROUGE, relevance scores, and human evaluation, our model outperforms the state-of-the-art baselines on MSMO dataset. Experiments on the applicability of our proposed framework and hyperparameters settings have been also conducted.

AB - Integrating multimodal knowledge for abstractive summarization task is a work-in-progress research area, with present techniques inheriting fusion-then-generation paradigm. Due to semantic gaps between computer vision and natural language processing, current methods often treat multiple data points as separate objects and rely on attention mechanisms to search for connection in order to fuse together. In addition, missing awareness of cross-modal matching from many frameworks leads to performance reduction. To solve these two drawbacks, we propose an Iterative Contrastive Alignment Framework (ICAF) that uses recurrent alignment and contrast to capture the coherences between images and texts. Specifically, we design a recurrent alignment (RA) layer to gradually investigate fine-grained semantical relationships between image patches and text tokens. At each step during the encoding process, crossmodal contrastive losses are applied to directly optimize the embedding space. According to ROUGE, relevance scores, and human evaluation, our model outperforms the state-of-the-art baselines on MSMO dataset. Experiments on the applicability of our proposed framework and hyperparameters settings have been also conducted.

KW - contrastive learning

KW - multimodal abstractive summarization

KW - recurrent alignment

UR - http://www.scopus.com/inward/record.url?scp=85140731806&partnerID=8YFLogxK

U2 - 10.1109/IJCNN55064.2022.9892884

DO - 10.1109/IJCNN55064.2022.9892884

M3 - Conference contribution

AN - SCOPUS:85140731806

T3 - Proceedings of the International Joint Conference on Neural Networks

BT - 2022 International Joint Conference on Neural Networks, IJCNN 2022 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 18 July 2022 through 23 July 2022

ER -

Zhang Z, Shu C, Chen Y, Xiao J, Zhang Q , Zheng L. ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization. In 2022 International Joint Conference on Neural Networks, IJCNN 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2022. (Proceedings of the International Joint Conference on Neural Networks). doi: 10.1109/IJCNN55064.2022.9892884

ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this