CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

Authors

  • Xiulong Liu University of Washington, Seattle, WA
  • Sudipta Paul Samsung Research America, Mountain View, CA
  • Moitreya Chatterjee Mitsubishi Electric Research Labs, Cambridge, MA
  • Anoop Cherian Mitsubishi Electric Research Labs, Cambridge, MA

DOI:

https://doi.org/10.1609/aaai.v38i4.28167

Keywords:

CV: Multi-modal Vision, CV: Language and Vision, ML: Multimodal Learning, ROB: Human-Robot Interaction

Abstract

Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that use only uni-directional interaction.

Downloads

Published

2024-03-24

How to Cite

Liu, X., Paul, S., Chatterjee, M., & Cherian, A. (2024). CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3765-3773. https://doi.org/10.1609/aaai.v38i4.28167

Issue

Section

AAAI Technical Track on Computer Vision III