Hierarchical Context Transformer for Multi-Level Semantic Scene Understanding

Luoying Hao, Yan Hu, Yang Yue, Li Wu, Huazhu Fu, Jinming Duan, Jiang Liu

Research output: Journal PublicationArticlepeer-review

Abstract

A comprehensive and explicit understanding ofsurgical scenes plays a vital role in developing context-awarecomputer-assisted systems in the operating theatre. However,few works provide systematical analysis to enable hierarchicalsurgical scene understanding. In this work, we propose torepresent the tasks set [phase recognition → step recognition →action and instrument detection] as multi-level semantic sceneunderstanding (MSSU). For this target, we propose a novelhierarchical context transformer (HCT) network and thoroughlyexplore the relations across the different level tasks. Specifically,a hierarchical relation aggregation module (HRAM) is designedto concurrently relate entries inside multi-level interaction infor-mation and then augment task-specific features. To further boostthe representation learning of the different tasks, inter-taskcontrastive learning (ICL) is presented to guide the model to learntask-wise features via absorbing complementary informationfrom other tasks. Furthermore, considering the computationalcosts of the transformer, we propose HCT+ to integrate thespatial and temporal adapter to access competitive performanceon substantially fewer tunable parameters. Extensive experimentson our cataract dataset and a publicly available endoscopicPSI-AVA dataset demonstrate the outstanding performance of ourmethod, consistently exceeding the state-of-the-art methods by alarge margin. The code is available at https://github.com/Aurora-hao/HCT.

Original languageEnglish
Pages (from-to)3134-3145
Number of pages12
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number4
DOIs
Publication statusPublished - 2025
Externally publishedYes

Keywords

  • inter-task contrastive learning
  • Multi-level semantic
  • spatial-temporal adapter
  • surgical scene understanding
  • transformer

ASJC Scopus subject areas

  • Media Technology
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Hierarchical Context Transformer for Multi-Level Semantic Scene Understanding'. Together they form a unique fingerprint.

Cite this