Open-vocabulary object detection with regionwise prompt selection and patch-based category-aware maximal activation

Zhaocheng Xu, Ruili Wang, Yan Tian, Tao Yang

Research output: Journal PublicationArticlepeer-review

Abstract

Object detectors perform unseen class detection by fine-tuning frozen visual-language models with visual prompts and natural language supervision. However, the current visual prompt tuning methods struggle to learn categorywise shared knowledge when using only one single visual prompt, due to the inaccessibility of unseen classes referred to as novel classes during training. This leads to isolation between the unknown novel classes and the base classes. Inspired by the recently developed RPN-based open-vocabulary object detection (OVOD) methods, we propose a region-aware visual prompt selection (RVPS) module to adaptively combine region features with best-matched visual prompts based on decoupled proxy embeddings. Additionally, we introduce a category-aware patchwise maximal aggregation (CPMA) module to explore the relationships among visual patches with respect to the category-specific maximum activation patches contained within the target region. We evaluate the proposed approach on an open-vocabulary benchmarks: COCO and LVIS. Compared with other state-of-the-art approaches, our method achieves a 1.2% AP improvement on COCO for novel classes and a 0.5% mask AP improvement on LVIS for rare categories.

Original languageEnglish
Article number880
JournalApplied Intelligence
Volume55
Issue number12
DOIs
Publication statusPublished - Aug 2025

Keywords

  • Computer vision
  • Deep learning
  • Open-vocabulary object detection
  • Prompt learning

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Open-vocabulary object detection with regionwise prompt selection and patch-based category-aware maximal activation'. Together they form a unique fingerprint.

Cite this