AerialFusion: Co-Motion-Driven Unified Registration and Fusion on Multi-modal Data Streams from Aerial View

Authors

  • Junhui Qiu School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
  • Xiang Xiang School of Computer Science and Technology, Huazhong University of Science and Technology School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
  • Hongyun Wang School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
  • Jiaqi Gui School of Artificial Intelligence and Automation, Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i10.37810

Abstract

Aerial multi-modal visual streams registration and fusion can generate more comprehensive scene information representations for UAVs' cross-modal perception. However, current challenges lie primarily in the essential difficulty of joint spatiotemporal representation learning from dynamic background and moving targets, and a critical shortage exists in large-scale, well-annotated multi-modal visual streams benchmark for UAV platforms. In this paper, we propose AerialFusion, a co-motion-driven unified UAVs visual streams registration and fusion that fully mines modality-invariant common features based on motion-aware, enabling spatiotemporally coherent registration and fusion. Specifically, 1) a Skewed Motion Distribution Field Co-Motion-Driven Image Registration, 2) a Co-Motion Generative Fusion, 3) a Streams-based Unified Learning. Furthermore, we introduce EUM3D, a registration and fusion benchmark for UAVs cross-modal perception. This benchmark contains 60 synchronized visible-infrared visual streams, or 122k spatially and temporally aligned pairs, most of which were taken at low-light scenes. And EUM3D provides pixel-level alignment guarantees via perspective-transform ground-truth. Extensive experiments reveal that AerialFusion surpasses current focus on image and static background fusion methods in aerial sequence scenarios, addressing spatiotemporal mismatches while suppressing cross-modal interference.

Downloads

Published

2026-03-14

How to Cite

Qiu, J., Xiang, X., Wang, H., & Gui, J. (2026). AerialFusion: Co-Motion-Driven Unified Registration and Fusion on Multi-modal Data Streams from Aerial View. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8583–8591. https://doi.org/10.1609/aaai.v40i10.37810

Issue

Section

AAAI Technical Track on Computer Vision VII