From Discriminative to Generative: A Diffusion-Based Paradigm for Multi-Agent Collaborative Perception
DOI:
https://doi.org/10.1609/aaai.v40i6.42423Abstract
Collaborative perception leveraging intermediate feature fusion has emerged as a leading paradigm to significantly enhance the environmental perception capabilities of autonomous driving systems. However, existing methods typically rely on discriminative supervision guided by downstream tasks. This paradigm compels models to learn minimal, task-specific representations, which conflicts with the goal of cooperative perception to capture comprehensive information, thereby limiting generalization. To address this issue, we propose DiGS-CP, a novel two-stage generative supervised collaborative perception framework. Specifically, we introduce a diffusion-based generative task that conditions on fused object-level features to generate representations of object-level point clouds. The proposed generative supervision provides fine-grained, task-agnostic signals that encourages the fusion module to learn comprehensive representations beyond task-specific requirements. By preserving and integrating complementary information from collaborative agents, our approach overcomes the limitations of task-specific learning and enhances the generalizability of the learned features. Furthermore, our two-stage architecture requires agents to transmit only object-level features, significantly reducing communication overhead. Extensive experiments on three benchmark datasets demonstrate that DiGS-CP achieves state-of-the-art performance in 3D object detection, while maintaining low bandwidth requirements and exhibiting excellent generalization ability.Downloads
Published
2026-03-14
How to Cite
Gong, K., Yao, P., Luo, G., Yuan, Q., Fu, T., Zhang, H., & Li, J. (2026). From Discriminative to Generative: A Diffusion-Based Paradigm for Multi-Agent Collaborative Perception. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4266–4274. https://doi.org/10.1609/aaai.v40i6.42423
Issue
Section
AAAI Technical Track on Computer Vision III