Abstract
Object detectors perform unseen class detection by fine-tuning frozen visual-language models with visual prompts and natural language supervision. However, the current visual prompt tuning methods struggle to learn categorywise shared knowledge when using only one single visual prompt, due to the inaccessibility of unseen classes referred to as novel classes during training. This leads to isolation between the unknown novel classes and the base classes. Inspired by the recently developed RPN-based open-vocabulary object detection (OVOD) methods, we propose a region-aware visual prompt selection (RVPS) module to adaptively combine region features with best-matched visual prompts based on decoupled proxy embeddings. Additionally, we introduce a category-aware patchwise maximal aggregation (CPMA) module to explore the relationships among visual patches with respect to the category-specific maximum activation patches contained within the target region. We evaluate the proposed approach on an open-vocabulary benchmarks: COCO and LVIS. Compared with other state-of-the-art approaches, our method achieves a 1.2% AP improvement on COCO for novel classes and a 0.5% mask AP improvement on LVIS for rare categories.
| Original language | English |
|---|---|
| Article number | 880 |
| Journal | Applied Intelligence |
| Volume | 55 |
| Issue number | 12 |
| DOIs | |
| Publication status | Published - Aug 2025 |
Keywords
- Computer vision
- Deep learning
- Open-vocabulary object detection
- Prompt learning
ASJC Scopus subject areas
- Artificial Intelligence