STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios
DOI:
https://doi.org/10.1609/aaai.v40i22.38882Abstract
This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: modality discrepancy, where existing touch-language models often treat touch as a mere sub-modality of language without further addressing the semantic differences, and open-ended tactile data scarcity, where current datasets lack the diversity, open-endedness, and complexity needed for reasoning. To overcome these challenges, we introduce SToLa, a Self-Adaptive Touch-Language framework. SToLa utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show SToLa exhibits competitive performance compared to existing models on the PHYSICLEAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.Published
2026-03-14
How to Cite
Cheng, N., Xu, J., Chen, J., Fang, B., & Han, W. (2026). STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18198–18206. https://doi.org/10.1609/aaai.v40i22.38882
Issue
Section
AAAI Technical Track on Intelligent Robotics