Knowledge enhancement and disentanglement learning for video captioning

Mingyue Wang, Yujun Ma, Biao Cai, Dongfen Li, Xiangjian He, Ruili Wang

Research output: Journal PublicationArticlepeer-review

Abstract

Video captioning, bridging computer vision and natural language, is crucial for various knowledge-based systems in the age of video streaming. Recent video captioning approaches have shown promise by integrating additional text-related knowledge to enhance understanding of video content and generate more informative captions. However, methods relying heavily on knowledge graphs face several limitations, including (i) a restricted capacity to reason complex relations among object words due to static logic rules, (ii) a lack of context awareness for spatio-temporal relation analysis in videos, and (iii) the complexity of manually constructing a knowledge graph. These limitations lead to insufficient semantic information and obstruct effective alignment between visual and textual modalities. To tackle these issues, we propose a novel knowledge enhancement and disentanglement learning method for video captioning. Our approach introduces a comprehensive and adaptable knowledge source to enhance text-related knowledge, thus directly improving caption generation. Specifically, we leverage a large language model to infer enriched semantic relations between object words and speech transcripts within video frames. By integrating visual, auditory, and textual information into universal tokens with task-specific prompts, our approach enhances semantic understanding and captures more diverse relations. Furthermore, we propose a novel modality-shared disentanglement learning strategy to better align modalities, enabling a more precise link of visual cues to their corresponding textual descriptions. Specifically, we disentangle two modalities into shared and specific features, leveraging shared features to ensure alignment while mitigating uncorrelated information. Extensive experiments demonstrate that our proposed method outperforms existing methods in both quantitative and qualitative results.

Original languageEnglish
Article number114003
JournalKnowledge-Based Systems
Volume326
DOIs
Publication statusPublished - 27 Sept 2025

Keywords

  • Disentanglement learning
  • Knowledge enhancement
  • Large language model
  • Video captioning

ASJC Scopus subject areas

  • Management Information Systems
  • Software
  • Information Systems and Management
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Knowledge enhancement and disentanglement learning for video captioning'. Together they form a unique fingerprint.

Cite this