A Multi-instance Multi-label Dual Learning Approach for Video Captioning

Wanting Ji; Ruili Wang

doi:10.1145/3446792

A Multi-instance Multi-label Dual Learning Approach for Video Captioning

Wanting Ji, Ruili Wang

Research output: Journal Publication › Article › peer-review

26 Citations (Scopus)

Abstract

Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.

Original language	English
Article number	72
Journal	ACM Transactions on Multimedia Computing, Communications and Applications
Volume	17
Issue number	2s
DOIs	https://doi.org/10.1145/3446792
Publication status	Published - Jun 2021
Externally published	Yes

Keywords

Deep neural networks
Dual learning
Multimedia processing
Multiple instance learning
Video captioning

ASJC Scopus subject areas

Hardware and Architecture
Computer Networks and Communications

Access to Document

10.1145/3446792

Cite this

@article{e781f80b4ac740339fc6fa14934d26ac,

title = "A Multi-instance Multi-label Dual Learning Approach for Video Captioning",

abstract = "Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.",

keywords = "Deep neural networks, Dual learning, Multimedia processing, Multiple instance learning, Video captioning",

author = "Wanting Ji and Ruili Wang",

note = "Publisher Copyright: {\textcopyright} 2021 Association for Computing Machinery.",

year = "2021",

month = jun,

doi = "10.1145/3446792",

language = "English",

volume = "17",

journal = "ACM Transactions on Multimedia Computing, Communications and Applications",

issn = "1551-6857",

publisher = "Association for Computing Machinery (ACM)",

number = "2s",

}

TY - JOUR

T1 - A Multi-instance Multi-label Dual Learning Approach for Video Captioning

AU - Ji, Wanting

AU - Wang, Ruili

PY - 2021/6

Y1 - 2021/6

N2 - Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.

AB - Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.

KW - Deep neural networks

KW - Dual learning

KW - Multimedia processing

KW - Multiple instance learning

KW - Video captioning

UR - http://www.scopus.com/inward/record.url?scp=85108535626&partnerID=8YFLogxK

U2 - 10.1145/3446792

DO - 10.1145/3446792

M3 - Article

AN - SCOPUS:85108535626

SN - 1551-6857

VL - 17

JO - ACM Transactions on Multimedia Computing, Communications and Applications

JF - ACM Transactions on Multimedia Computing, Communications and Applications

IS - 2s

M1 - 72

ER -

A Multi-instance Multi-label Dual Learning Approach for Video Captioning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this