FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Authors

  • Yuchen Wu School of Computer Science and Engineering, State Key Laboratory of Complex Critical Software Environment, Jiangxi Research Institute, Beihang University
  • Jiahe Li School of Computer Science and Engineering, State Key Laboratory of Complex Critical Software Environment, Jiangxi Research Institute, Beihang University
  • Fabio Tosi University of Bologna
  • Matteo Poggi University of Bologna
  • Jin Zheng School of Computer Science and Engineering, State Key Laboratory of Complex Critical Software Environment, Jiangxi Research Institute, Beihang University State Key Laboratory of Virtual Reality Technology and Systems, Beijing
  • Xiao Bai School of Computer Science and Engineering, State Key Laboratory of Complex Critical Software Environment, Jiangxi Research Institute, Beihang University

DOI:

https://doi.org/10.1609/aaai.v40i13.38061

Abstract

We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

Published

2026-03-14

How to Cite

Wu, Y., Li, J., Tosi, F., Poggi, M., Zheng, J., & Bai, X. (2026). FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM. Proceedings of the AAAI Conference on Artificial Intelligence, 40(13), 10853-10861. https://doi.org/10.1609/aaai.v40i13.38061

Issue

Section

AAAI Technical Track on Computer Vision X