Fan, J., Chen, P., Li, C., Du, Q., Chen, J., & Tan, M. (2026). NaVLA$^2$: A Vision-Language-Audio-Action Model for Multimodal Instruction Navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18234–18242. https://doi.org/10.1609/aaai.v40i22.38886