Abstract
A comprehensive and explicit understanding ofsurgical scenes plays a vital role in developing context-awarecomputer-assisted systems in the operating theatre. However,few works provide systematical analysis to enable hierarchicalsurgical scene understanding. In this work, we propose torepresent the tasks set [phase recognition → step recognition →action and instrument detection] as multi-level semantic sceneunderstanding (MSSU). For this target, we propose a novelhierarchical context transformer (HCT) network and thoroughlyexplore the relations across the different level tasks. Specifically,a hierarchical relation aggregation module (HRAM) is designedto concurrently relate entries inside multi-level interaction infor-mation and then augment task-specific features. To further boostthe representation learning of the different tasks, inter-taskcontrastive learning (ICL) is presented to guide the model to learntask-wise features via absorbing complementary informationfrom other tasks. Furthermore, considering the computationalcosts of the transformer, we propose HCT+ to integrate thespatial and temporal adapter to access competitive performanceon substantially fewer tunable parameters. Extensive experimentson our cataract dataset and a publicly available endoscopicPSI-AVA dataset demonstrate the outstanding performance of ourmethod, consistently exceeding the state-of-the-art methods by alarge margin. The code is available at https://github.com/Aurora-hao/HCT.
Original language | English |
---|---|
Pages (from-to) | 3134-3145 |
Number of pages | 12 |
Journal | IEEE Transactions on Circuits and Systems for Video Technology |
Volume | 35 |
Issue number | 4 |
DOIs | |
Publication status | Published - 2025 |
Externally published | Yes |
Keywords
- inter-task contrastive learning
- Multi-level semantic
- spatial-temporal adapter
- surgical scene understanding
- transformer
ASJC Scopus subject areas
- Media Technology
- Electrical and Electronic Engineering