Abstract
Video captioning, bridging computer vision and natural language, is crucial for various knowledge-based systems in the age of video streaming. Recent video captioning approaches have shown promise by integrating additional text-related knowledge to enhance understanding of video content and generate more informative captions. However, methods relying heavily on knowledge graphs face several limitations, including (i) a restricted capacity to reason complex relations among object words due to static logic rules, (ii) a lack of context awareness for spatio-temporal relation analysis in videos, and (iii) the complexity of manually constructing a knowledge graph. These limitations lead to insufficient semantic information and obstruct effective alignment between visual and textual modalities. To tackle these issues, we propose a novel knowledge enhancement and disentanglement learning method for video captioning. Our approach introduces a comprehensive and adaptable knowledge source to enhance text-related knowledge, thus directly improving caption generation. Specifically, we leverage a large language model to infer enriched semantic relations between object words and speech transcripts within video frames. By integrating visual, auditory, and textual information into universal tokens with task-specific prompts, our approach enhances semantic understanding and captures more diverse relations. Furthermore, we propose a novel modality-shared disentanglement learning strategy to better align modalities, enabling a more precise link of visual cues to their corresponding textual descriptions. Specifically, we disentangle two modalities into shared and specific features, leveraging shared features to ensure alignment while mitigating uncorrelated information. Extensive experiments demonstrate that our proposed method outperforms existing methods in both quantitative and qualitative results.
| Original language | English |
|---|---|
| Article number | 114003 |
| Journal | Knowledge-Based Systems |
| Volume | 326 |
| DOIs | |
| Publication status | Published - 27 Sept 2025 |
Keywords
- Disentanglement learning
- Knowledge enhancement
- Large language model
- Video captioning
ASJC Scopus subject areas
- Management Information Systems
- Software
- Information Systems and Management
- Artificial Intelligence