STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios

Authors

  • Ning Cheng Key Laboratory of Big Data & Artificial Intelligence in Transportation(Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
  • Jinan Xu Key Laboratory of Big Data & Artificial Intelligence in Transportation(Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
  • Jialing Chen Key Laboratory of Big Data & Artificial Intelligence in Transportation(Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
  • Bin Fang Beijing University of Posts and Telecommunications, 100876, Beijing, China
  • Wenjuan Han Key Laboratory of Big Data & Artificial Intelligence in Transportation(Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China

DOI:

https://doi.org/10.1609/aaai.v40i22.38882

Abstract

This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: modality discrepancy, where existing touch-language models often treat touch as a mere sub-modality of language without further addressing the semantic differences, and open-ended tactile data scarcity, where current datasets lack the diversity, open-endedness, and complexity needed for reasoning. To overcome these challenges, we introduce SToLa, a Self-Adaptive Touch-Language framework. SToLa utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show SToLa exhibits competitive performance compared to existing models on the PHYSICLEAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.

Downloads

Published

2026-03-14

How to Cite

Cheng, N., Xu, J., Chen, J., Fang, B., & Han, W. (2026). STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18198–18206. https://doi.org/10.1609/aaai.v40i22.38882

Issue

Section

AAAI Technical Track on Intelligent Robotics