SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li; Renshan Zhang; Rui Shao; Zhijian Fang; Kaiwen Zhou; Zhuotao Tian; Liqiang Nie

doi:10.1609/aaai.v40i22.38904

Authors

Wei Li Harbin Institute of Technology (Shenzhen)
Renshan Zhang Harbin Institute of Technology (Shenzhen)
Rui Shao Harbin Institute of Technology (Shenzhen), Shenzhen Loop Area Institute
Zhijian Fang Harbin Institute of Technology (Shenzhen)
Kaiwen Zhou Huawei Noah's Ark Lab
Zhuotao Tian Harbin Institute of Technology (Shenzhen)
Liqiang Nie Harbin Institute of Technology (Shenzhen)

DOI:

https://doi.org/10.1609/aaai.v40i22.38904

Abstract

Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: **1) perceptual redundancy**, where irrelevant visual inputs are processed inefficiently, and **2) superficial instruction-vision alignment**, which hampers semantic grounding of actions. In this paper, we propose **SemanticVLA**, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: **1)** To sparsify redundant perception while preserving semantic alignment, **Semantic-guided Dual Visual Pruner (SD-Pruner)** performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. **2)** To exploit sparsified features and integrate semantics with spatial geometry, **Semantic-complementary Hierarchical Fuser (SH-Fuser)** fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. **3)** To enhance the transformation from perception to action, **Semantic-conditioned Action Coupler (SA-Coupler)** replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by **21.1%** in success rate, while reducing training cost and inference latency by **3.0×** and **2.7×**.

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information