A Multi-instance Multi-label Dual Learning Approach for Video Captioning

Wanting Ji, Ruili Wang

Research output: Journal PublicationArticlepeer-review

20 Citations (Scopus)

Abstract

Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.

Original languageEnglish
Article number72
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume17
Issue number2s
DOIs
Publication statusPublished - Jun 2021
Externally publishedYes

Keywords

  • Deep neural networks
  • Dual learning
  • Multimedia processing
  • Multiple instance learning
  • Video captioning

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A Multi-instance Multi-label Dual Learning Approach for Video Captioning'. Together they form a unique fingerprint.

Cite this