Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

Authors

  • Haoyu Zhang Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
  • Qiaohui Chu Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
  • Meng Liu Shandong Jianzhu University Zhongguancun Academy
  • Haoxiang Shi Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
  • Yaowei Wang Harbin Institute of Technology (Shenzhen) Pengcheng Laboratory
  • Liqiang Nie Harbin Institute of Technology (Shenzhen)

DOI:

https://doi.org/10.1609/aaai.v40i15.38244

Abstract

AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. However, current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique challenges of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D, together with the instruction-tuning dataset EgoIT, which is collected from multiple sources to enhance the model's instruction-following capabilities. Building upon the datasets, we propose a migration strategy and further design a progressive mapping learning pipeline with three stages: Demonstrator Self-Preparation, Demonstrator-Learner Guidance, and Learner Self-Practice. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.

Downloads

Published

2026-03-14

How to Cite

Zhang, H., Chu, Q., Liu, M., Shi, H., Wang, Y., & Nie, L. (2026). Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12502–12510. https://doi.org/10.1609/aaai.v40i15.38244

Issue

Section

AAAI Technical Track on Computer Vision XII