Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

Authors

  • Mehran Tamjidi University of Technology Sydney
  • Hamidreza Dastmalchi York University
  • Mohammadreza Alimoradijazi University of New South Wales
  • Ali Cheraghian Macquarie University
  • Aijun An York University
  • Morteza Saberi University of Technology Sydney

DOI:

https://doi.org/10.1609/aaai.v40i11.37888

Abstract

3D Vision-Language Foundation Models (VLFMs) have demonstrated strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, their performance often degrades in practical scenarios where data are noisy, incomplete, or drawn from distributions that differ from the training data. To address this challenge, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. Uni-Adapter maintains a 3D cache that stores class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability under heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation through similarity scoring. In parallel, a graph-based label smoothing module models inter-prototype similarities to enforce label consistency among related prototypes. Finally, predictions from the original 3D VLFM and the refined 3D cache are unified through entropy-weighted aggregation to ensure reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts and achieves state-of-the-art performance across diverse 3D benchmarks and multiple 3D VLFMs, improving performance on ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.

Published

2026-03-14

How to Cite

Tamjidi, M., Dastmalchi, H., Alimoradijazi, M., Cheraghian, A., An, A., & Saberi, M. (2026). Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 9296–9304. https://doi.org/10.1609/aaai.v40i11.37888

Issue

Section

AAAI Technical Track on Computer Vision VIII