MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection

Authors

  • Qiang Qi Qingdao University of Science and Technology
  • Wenqi Shang Qingdao University of Science and Technology
  • Xiao Wang Qingdao University of Science and Technology
  • Yanjie Liang Pengcheng Laboratpry
  • Shuyuan Lin Jinan University

DOI:

https://doi.org/10.1609/aaai.v40i10.37798

Abstract

Video object detection is a fundamental yet challenging task in computer vision. Recently, DETR-based methods have gained prominence in this domain owing to their powerful global modeling capabilities. However, these methods are still confronted with two key limitations: frame-agnostic initialization of object queries and scale-agnostic attention mechanisms, which hinder their capability to capture the appearance variations of dynamic objects and model the temporal consistency across frames. To alleviate these limitations, we propose a multiscale-aware transformer diffusion network (MSTDiff), a novel framework designed for the video object detection task, including two technical improvements over existing methods. First, we design a diffusion-driven adaptive query module, which models the object query distribution through a diffusion process conditioned on input frames, enabling an adaptive and content-aware initialization of object queries. Second, we develop a multiscale-aware transformer encoder module, which combines multi-head convolutional units with attention mechanisms to enhance multi-scale feature representations while preserving global dependence modeling. We conduct extensive experiments on the public ImageNet VID dataset, and the results demonstrate that our MSTDiff achieves 87.7% mAP with ResNet-101, outperforming most previous state-of-the-art video object detection methods.

Downloads

Published

2026-03-14

How to Cite

Qi, Q., Shang, W., Wang, X., Liang, Y., & Lin, S. (2026). MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8475–8483. https://doi.org/10.1609/aaai.v40i10.37798

Issue

Section

AAAI Technical Track on Computer Vision VII