EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision

Authors

  • Yifei Cao Dalian University of Technology
  • Yu Liu Dalian University of Technology
  • Guolong Wang University of International Business and Economics
  • Zhu Liu Dalian University of Technology
  • Kai Wang Harbin Institute of Technology
  • Xianjie Zhang Academy of Satellite Application Innovation, CASC
  • Jizhe Yu Dalian University of Technology
  • Xun Tu Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i4.37251

Abstract

Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-of-the-art performance on the Ego4D-VQ benchmark.

Published

2026-03-14

How to Cite

Cao, Y., Liu, Y., Wang, G., Liu, Z., Wang, K., Zhang, X., Yu, J., & Tu, X. (2026). EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2634-2642. https://doi.org/10.1609/aaai.v40i4.37251

Issue

Section

AAAI Technical Track on Computer Vision I