Monocular Camera-Based Point-Goal Navigation by Learning Depth Channel and Cross-Modality Pyramid Fusion

Authors

  • Tianqi Tang University of Technology Sydney
  • Heming Du Australian National University
  • Xin Yu University of Technology Sydney
  • Yi Yang University of Technology Sydney

DOI:

https://doi.org/10.1609/aaai.v36i5.20480

Keywords:

Intelligent Robotics (ROB)

Abstract

For a monocular camera-based navigation system, if we could effectively explore scene geometric cues from RGB images, the geometry information will significantly facilitate the efficiency of the navigation system. Motivated by this, we propose a highly efficient point-goal navigation framework, dubbed Geo-Nav. In a nutshell, our Geo-Nav consists of two parts: a visual perception part and a navigation part. In the visual perception part, we firstly propose a Self-supervised Depth Estimation network (SDE) specially tailored for the monocular camera-based navigation agent. Our SDE learns a mapping from an RGB input image to its corresponding depth image by exploring scene geometric constraints in a self-consistency manner. Then, in order to achieve a representative visual representation from the RGB inputs and learned depth images, we propose a Cross-modality Pyramid Fusion module (CPF). Concretely, our CPF computes a patch-wise cross-modality correlation between different modal features and exploits the correlation to fuse and enhance features at each scale. Thanks to the patch-wise nature of our CPF, we can fuse feature maps at high resolution, allowing our visual network to perceive more image details. In the navigation part, our extracted visual representations are fed to a navigation policy network to learn how to map the visual representations to agent actions effectively. Extensive experiments on a widely-used multiple-room environment Gibson demonstrate that Geo-Nav outperforms the state-of-the-art in terms of efficiency and effectiveness.

Downloads

Published

2022-06-28

How to Cite

Tang, T., Du, H., Yu, X., & Yang, Y. (2022). Monocular Camera-Based Point-Goal Navigation by Learning Depth Channel and Cross-Modality Pyramid Fusion. Proceedings of the AAAI Conference on Artificial Intelligence, 36(5), 5422-5430. https://doi.org/10.1609/aaai.v36i5.20480

Issue

Section

AAAI Technical Track on Intelligent Robotics