MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection

Qiang Qi; Wenqi Shang; Xiao Wang; Yanjie Liang; Shuyuan Lin

doi:10.1609/aaai.v40i10.37798

Authors

Qiang Qi Qingdao University of Science and Technology
Wenqi Shang Qingdao University of Science and Technology
Xiao Wang Qingdao University of Science and Technology
Yanjie Liang Pengcheng Laboratpry
Shuyuan Lin Jinan University

DOI:

https://doi.org/10.1609/aaai.v40i10.37798

Abstract

Video object detection is a fundamental yet challenging task in computer vision. Recently, DETR-based methods have gained prominence in this domain owing to their powerful global modeling capabilities. However, these methods are still confronted with two key limitations: frame-agnostic initialization of object queries and scale-agnostic attention mechanisms, which hinder their capability to capture the appearance variations of dynamic objects and model the temporal consistency across frames. To alleviate these limitations, we propose a multiscale-aware transformer diffusion network (MSTDiff), a novel framework designed for the video object detection task, including two technical improvements over existing methods. First, we design a diffusion-driven adaptive query module, which models the object query distribution through a diffusion process conditioned on input frames, enabling an adaptive and content-aware initialization of object queries. Second, we develop a multiscale-aware transformer encoder module, which combines multi-head convolutional units with attention mechanisms to enhance multi-scale feature representations while preserving global dependence modeling. We conduct extensive experiments on the public ImageNet VID dataset, and the results demonstrate that our MSTDiff achieves 87.7% mAP with ResNet-101, outperforming most previous state-of-the-art video object detection methods.

MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information