SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Jiaqi Wang; Liutao Yu; Xiongri Shen; Sihang Guo; Chenlin Zhou; Leilei Zhao; Yi Zhong; Zhiguo Zhang; Zhengyu Ma

doi:10.1609/aaai.v40i3.37194

Authors

Jiaqi Wang Harbin Institute of Technology, Shenzhen Pengcheng Laboratory
Liutao Yu Pengcheng Laboratory
Xiongri Shen Harbin Institute of Technology, Shenzhen
Sihang Guo Harbin Institute of Technology, Shenzhen
Chenlin Zhou Peking University Pengcheng Laboratory
Leilei Zhao Harbin Institute of Technology, Shenzhen
Yi Zhong Harbin Institute of Technology, Shenzhen Great Bay University
Zhiguo Zhang Harbin Institute of Technology, Shenzhen
Zhengyu Ma Pengcheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i3.37194

Abstract

Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.

SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information