MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection
DOI:
https://doi.org/10.1609/aaai.v40i10.37798Abstract
Video object detection is a fundamental yet challenging task in computer vision. Recently, DETR-based methods have gained prominence in this domain owing to their powerful global modeling capabilities. However, these methods are still confronted with two key limitations: frame-agnostic initialization of object queries and scale-agnostic attention mechanisms, which hinder their capability to capture the appearance variations of dynamic objects and model the temporal consistency across frames. To alleviate these limitations, we propose a multiscale-aware transformer diffusion network (MSTDiff), a novel framework designed for the video object detection task, including two technical improvements over existing methods. First, we design a diffusion-driven adaptive query module, which models the object query distribution through a diffusion process conditioned on input frames, enabling an adaptive and content-aware initialization of object queries. Second, we develop a multiscale-aware transformer encoder module, which combines multi-head convolutional units with attention mechanisms to enhance multi-scale feature representations while preserving global dependence modeling. We conduct extensive experiments on the public ImageNet VID dataset, and the results demonstrate that our MSTDiff achieves 87.7% mAP with ResNet-101, outperforming most previous state-of-the-art video object detection methods.Downloads
Published
2026-03-14
How to Cite
Qi, Q., Shang, W., Wang, X., Liang, Y., & Lin, S. (2026). MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8475–8483. https://doi.org/10.1609/aaai.v40i10.37798
Issue
Section
AAAI Technical Track on Computer Vision VII