STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios

Ning Cheng; Jinan Xu; Jialing Chen; Bin Fang; Wenjuan Han

doi:10.1609/aaai.v40i22.38882

Authors

Ning Cheng Key Laboratory of Big Data & Artificial Intelligence in Transportation（Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
Jinan Xu Key Laboratory of Big Data & Artificial Intelligence in Transportation（Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
Jialing Chen Key Laboratory of Big Data & Artificial Intelligence in Transportation（Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
Bin Fang Beijing University of Posts and Telecommunications, 100876, Beijing, China
Wenjuan Han Key Laboratory of Big Data & Artificial Intelligence in Transportation（Beijing Jiaotong University), Ministry of Education School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China

DOI:

https://doi.org/10.1609/aaai.v40i22.38882

Abstract

This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: modality discrepancy, where existing touch-language models often treat touch as a mere sub-modality of language without further addressing the semantic differences, and open-ended tactile data scarcity, where current datasets lack the diversity, open-endedness, and complexity needed for reasoning. To overcome these challenges, we introduce SToLa, a Self-Adaptive Touch-Language framework. SToLa utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show SToLa exhibits competitive performance compared to existing models on the PHYSICLEAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.

STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information