MDNet: Multi-Modal Cooperative Perception via Spatial Alignment of Modal Decision-Making

Junyang He; Xiaoheng Deng; Jinsong Gui; Tao Zhang; Xiangjian He

doi:10.1109/JIOT.2025.3531145

MDNet: Multi-Modal Cooperative Perception via Spatial Alignment of Modal Decision-Making

Junyang He, Xiaoheng Deng, Jinsong Gui, Tao Zhang, Xiangjian He

School of Computer Science

Research output: Journal Publication › Article › peer-review

1 Citation (Scopus)

Abstract

Through internet of things (IOT) communication technology, collaborative perception enhances a vehicle's capacity to discern its surroundings while driving by integrating and synchronizing sensor data from multiple agents. With the advancement of cooperative perception techniques in single-modality methods, there has been a growing trend toward integrating multi-modal data from heterogeneous sensors in recent years. However, due to the data heterogeneity inherent in diverse sensors, Bird's Eye View (BEV) maps generated from different types of sensors may exhibit local discrepancies in the spatial representation of entity positions. Furthermore, individual agents may produce uncertain and flawed feature representations in real noisy environments. The influence of this indeterminacy exacerbates the issue of local inconsistency, leading to misalignment of the detected target during BEV alignment and fusion, thereby reducing detection accuracy. To address these problems, we propose a modal decision-making spatial alignment cooperative perception network (MDNet). First, the network generates BEV feature maps through dense depth image supervision for voxel feature extraction and model-guided selective feature fusion. Subsequently, we achieve enhanced accuracy in object detection by performing spatial alignment of BEV representations generated from two distinct sensors, both globally and locally within the spatial domain. Besides, we employ a cascaded centralized pyramid strategy during the message fusion stage, facilitating flexible sampling across horizontal and vertical spatial dimensions, promoting deep interaction among multiple agents. We conduct quantitative and qualitative experiments on the public OPV2V and DAIR-V2X-C benchmarks, and our proposed MDNet exhibits superior performance and stronger robustness in the 3D object detection task, providing more precise target detection results.

Original language	English
Journal	IEEE Internet of Things Journal
DOIs	https://doi.org/10.1109/JIOT.2025.3531145
Publication status	Accepted/In press - 2025

Keywords

3D Object Detection
Cooperative Perception
Internet of Things (IOT)
Multi-Agent Perception
Multi-Modal Fusion
Vehicle-to-Vehicle Application

ASJC Scopus subject areas

Signal Processing
Information Systems
Hardware and Architecture
Computer Science Applications
Computer Networks and Communications

Access to Document

10.1109/JIOT.2025.3531145

Cite this

@article{97f9987e94d94aa299dcce931a07888f,

title = "MDNet: Multi-Modal Cooperative Perception via Spatial Alignment of Modal Decision-Making",

abstract = "Through internet of things (IOT) communication technology, collaborative perception enhances a vehicle's capacity to discern its surroundings while driving by integrating and synchronizing sensor data from multiple agents. With the advancement of cooperative perception techniques in single-modality methods, there has been a growing trend toward integrating multi-modal data from heterogeneous sensors in recent years. However, due to the data heterogeneity inherent in diverse sensors, Bird's Eye View (BEV) maps generated from different types of sensors may exhibit local discrepancies in the spatial representation of entity positions. Furthermore, individual agents may produce uncertain and flawed feature representations in real noisy environments. The influence of this indeterminacy exacerbates the issue of local inconsistency, leading to misalignment of the detected target during BEV alignment and fusion, thereby reducing detection accuracy. To address these problems, we propose a modal decision-making spatial alignment cooperative perception network (MDNet). First, the network generates BEV feature maps through dense depth image supervision for voxel feature extraction and model-guided selective feature fusion. Subsequently, we achieve enhanced accuracy in object detection by performing spatial alignment of BEV representations generated from two distinct sensors, both globally and locally within the spatial domain. Besides, we employ a cascaded centralized pyramid strategy during the message fusion stage, facilitating flexible sampling across horizontal and vertical spatial dimensions, promoting deep interaction among multiple agents. We conduct quantitative and qualitative experiments on the public OPV2V and DAIR-V2X-C benchmarks, and our proposed MDNet exhibits superior performance and stronger robustness in the 3D object detection task, providing more precise target detection results.",

keywords = "3D Object Detection, Cooperative Perception, Internet of Things (IOT), Multi-Agent Perception, Multi-Modal Fusion, Vehicle-to-Vehicle Application",

author = "Junyang He and Xiaoheng Deng and Jinsong Gui and Tao Zhang and Xiangjian He",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.",

year = "2025",

doi = "10.1109/JIOT.2025.3531145",

language = "English",

journal = "IEEE Internet of Things Journal",

issn = "2327-4662",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - MDNet

T2 - Multi-Modal Cooperative Perception via Spatial Alignment of Modal Decision-Making

AU - He, Junyang

AU - Deng, Xiaoheng

AU - Gui, Jinsong

AU - Zhang, Tao

AU - He, Xiangjian

PY - 2025

Y1 - 2025

N2 - Through internet of things (IOT) communication technology, collaborative perception enhances a vehicle's capacity to discern its surroundings while driving by integrating and synchronizing sensor data from multiple agents. With the advancement of cooperative perception techniques in single-modality methods, there has been a growing trend toward integrating multi-modal data from heterogeneous sensors in recent years. However, due to the data heterogeneity inherent in diverse sensors, Bird's Eye View (BEV) maps generated from different types of sensors may exhibit local discrepancies in the spatial representation of entity positions. Furthermore, individual agents may produce uncertain and flawed feature representations in real noisy environments. The influence of this indeterminacy exacerbates the issue of local inconsistency, leading to misalignment of the detected target during BEV alignment and fusion, thereby reducing detection accuracy. To address these problems, we propose a modal decision-making spatial alignment cooperative perception network (MDNet). First, the network generates BEV feature maps through dense depth image supervision for voxel feature extraction and model-guided selective feature fusion. Subsequently, we achieve enhanced accuracy in object detection by performing spatial alignment of BEV representations generated from two distinct sensors, both globally and locally within the spatial domain. Besides, we employ a cascaded centralized pyramid strategy during the message fusion stage, facilitating flexible sampling across horizontal and vertical spatial dimensions, promoting deep interaction among multiple agents. We conduct quantitative and qualitative experiments on the public OPV2V and DAIR-V2X-C benchmarks, and our proposed MDNet exhibits superior performance and stronger robustness in the 3D object detection task, providing more precise target detection results.

AB - Through internet of things (IOT) communication technology, collaborative perception enhances a vehicle's capacity to discern its surroundings while driving by integrating and synchronizing sensor data from multiple agents. With the advancement of cooperative perception techniques in single-modality methods, there has been a growing trend toward integrating multi-modal data from heterogeneous sensors in recent years. However, due to the data heterogeneity inherent in diverse sensors, Bird's Eye View (BEV) maps generated from different types of sensors may exhibit local discrepancies in the spatial representation of entity positions. Furthermore, individual agents may produce uncertain and flawed feature representations in real noisy environments. The influence of this indeterminacy exacerbates the issue of local inconsistency, leading to misalignment of the detected target during BEV alignment and fusion, thereby reducing detection accuracy. To address these problems, we propose a modal decision-making spatial alignment cooperative perception network (MDNet). First, the network generates BEV feature maps through dense depth image supervision for voxel feature extraction and model-guided selective feature fusion. Subsequently, we achieve enhanced accuracy in object detection by performing spatial alignment of BEV representations generated from two distinct sensors, both globally and locally within the spatial domain. Besides, we employ a cascaded centralized pyramid strategy during the message fusion stage, facilitating flexible sampling across horizontal and vertical spatial dimensions, promoting deep interaction among multiple agents. We conduct quantitative and qualitative experiments on the public OPV2V and DAIR-V2X-C benchmarks, and our proposed MDNet exhibits superior performance and stronger robustness in the 3D object detection task, providing more precise target detection results.

KW - 3D Object Detection

KW - Cooperative Perception

KW - Internet of Things (IOT)

KW - Multi-Agent Perception

KW - Multi-Modal Fusion

KW - Vehicle-to-Vehicle Application

UR - http://www.scopus.com/inward/record.url?scp=85215586441&partnerID=8YFLogxK

U2 - 10.1109/JIOT.2025.3531145

DO - 10.1109/JIOT.2025.3531145

M3 - Article

AN - SCOPUS:85215586441

SN - 2327-4662

JO - IEEE Internet of Things Journal

JF - IEEE Internet of Things Journal

ER -

MDNet: Multi-Modal Cooperative Perception via Spatial Alignment of Modal Decision-Making

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this