An attention based dual learning approach for video captioning

Wanting Ji; Ruili Wang; Yan Tian; Xun Wang

doi:10.1016/j.asoc.2021.108332

An attention based dual learning approach for video captioning

Wanting Ji, Ruili Wang, Yan Tian, Xun Wang

Research output: Journal Publication › Article › peer-review

30 Citations (Scopus)

Abstract

Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual information of a video to generate captions. Recently, a new encoder–decoder–reconstructorarchitecture was developed for video captioning, which can capture the information in both raw videos and the generated captions through dual learning. Based on this architecture, this paper proposes a novel attention based dual learning approach (ADL) for video captioning. Specifically, ADL is composed of a caption generation module and a video reconstruction module. The caption generation module builds a translatable mapping between raw video frames and the generated video captions, i.e., using the visual features extracted from videos by an Inception-V4 network to produce video captions. Then the video reconstruction module reproduces raw video frames using the generated video captions, i.e., using the hidden states of the decoder in the caption generation module to reproduce/synthesize raw visual features. A multi-head attention mechanism is adopted to help the two modules focus on the most effective information in videos and captions, and a dual learning mechanism is adopted to fine-tune the performance of the two modules to generate final video captions. Therefore, ADL can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw videos, thereby improving the quality of the generated video captions. Experimental results demonstrate that ADL is superior to the state-of-the-art video captioning approaches on benchmark datasets.

Original language	English
Article number	108332
Journal	Applied Soft Computing Journal
Volume	117
DOIs	https://doi.org/10.1016/j.asoc.2021.108332
Publication status	Published - Mar 2022
Externally published	Yes

Keywords

Attention mechanism
Deep neural network
Dual learning
Encoder–decoder
Video captioning

ASJC Scopus subject areas

Software

Access to Document

10.1016/j.asoc.2021.108332

Cite this

@article{1b84d96372be4a938f67438fe4a28e9b,

title = "An attention based dual learning approach for video captioning",

abstract = "Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual information of a video to generate captions. Recently, a new encoder–decoder–reconstructorarchitecture was developed for video captioning, which can capture the information in both raw videos and the generated captions through dual learning. Based on this architecture, this paper proposes a novel attention based dual learning approach (ADL) for video captioning. Specifically, ADL is composed of a caption generation module and a video reconstruction module. The caption generation module builds a translatable mapping between raw video frames and the generated video captions, i.e., using the visual features extracted from videos by an Inception-V4 network to produce video captions. Then the video reconstruction module reproduces raw video frames using the generated video captions, i.e., using the hidden states of the decoder in the caption generation module to reproduce/synthesize raw visual features. A multi-head attention mechanism is adopted to help the two modules focus on the most effective information in videos and captions, and a dual learning mechanism is adopted to fine-tune the performance of the two modules to generate final video captions. Therefore, ADL can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw videos, thereby improving the quality of the generated video captions. Experimental results demonstrate that ADL is superior to the state-of-the-art video captioning approaches on benchmark datasets.",

keywords = "Attention mechanism, Deep neural network, Dual learning, Encoder–decoder, Video captioning",

author = "Wanting Ji and Ruili Wang and Yan Tian and Xun Wang",

note = "Publisher Copyright: {\textcopyright} 2021",

year = "2022",

month = mar,

doi = "10.1016/j.asoc.2021.108332",

language = "English",

volume = "117",

journal = "Applied Soft Computing Journal",

issn = "1568-4946",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - An attention based dual learning approach for video captioning

AU - Ji, Wanting

AU - Wang, Ruili

AU - Tian, Yan

AU - Wang, Xun

PY - 2022/3

Y1 - 2022/3

N2 - Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual information of a video to generate captions. Recently, a new encoder–decoder–reconstructorarchitecture was developed for video captioning, which can capture the information in both raw videos and the generated captions through dual learning. Based on this architecture, this paper proposes a novel attention based dual learning approach (ADL) for video captioning. Specifically, ADL is composed of a caption generation module and a video reconstruction module. The caption generation module builds a translatable mapping between raw video frames and the generated video captions, i.e., using the visual features extracted from videos by an Inception-V4 network to produce video captions. Then the video reconstruction module reproduces raw video frames using the generated video captions, i.e., using the hidden states of the decoder in the caption generation module to reproduce/synthesize raw visual features. A multi-head attention mechanism is adopted to help the two modules focus on the most effective information in videos and captions, and a dual learning mechanism is adopted to fine-tune the performance of the two modules to generate final video captions. Therefore, ADL can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw videos, thereby improving the quality of the generated video captions. Experimental results demonstrate that ADL is superior to the state-of-the-art video captioning approaches on benchmark datasets.

AB - Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual information of a video to generate captions. Recently, a new encoder–decoder–reconstructorarchitecture was developed for video captioning, which can capture the information in both raw videos and the generated captions through dual learning. Based on this architecture, this paper proposes a novel attention based dual learning approach (ADL) for video captioning. Specifically, ADL is composed of a caption generation module and a video reconstruction module. The caption generation module builds a translatable mapping between raw video frames and the generated video captions, i.e., using the visual features extracted from videos by an Inception-V4 network to produce video captions. Then the video reconstruction module reproduces raw video frames using the generated video captions, i.e., using the hidden states of the decoder in the caption generation module to reproduce/synthesize raw visual features. A multi-head attention mechanism is adopted to help the two modules focus on the most effective information in videos and captions, and a dual learning mechanism is adopted to fine-tune the performance of the two modules to generate final video captions. Therefore, ADL can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw videos, thereby improving the quality of the generated video captions. Experimental results demonstrate that ADL is superior to the state-of-the-art video captioning approaches on benchmark datasets.

KW - Attention mechanism

KW - Deep neural network

KW - Dual learning

KW - Encoder–decoder

KW - Video captioning

UR - http://www.scopus.com/inward/record.url?scp=85122645859&partnerID=8YFLogxK

U2 - 10.1016/j.asoc.2021.108332

DO - 10.1016/j.asoc.2021.108332

M3 - Article

AN - SCOPUS:85122645859

SN - 1568-4946

VL - 117

JO - Applied Soft Computing Journal

JF - Applied Soft Computing Journal

M1 - 108332

ER -

An attention based dual learning approach for video captioning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this