SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Authors

  • Jiaqi Wang Harbin Institute of Technology, Shenzhen Pengcheng Laboratory
  • Liutao Yu Pengcheng Laboratory
  • Xiongri Shen Harbin Institute of Technology, Shenzhen
  • Sihang Guo Harbin Institute of Technology, Shenzhen
  • Chenlin Zhou Peking University Pengcheng Laboratory
  • Leilei Zhao Harbin Institute of Technology, Shenzhen
  • Yi Zhong Harbin Institute of Technology, Shenzhen Great Bay University
  • Zhiguo Zhang Harbin Institute of Technology, Shenzhen
  • Zhengyu Ma Pengcheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i3.37194

Abstract

Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.

Downloads

Published

2026-03-14

How to Cite

Wang, J., Yu, L., Shen, X., Guo, S., Zhou, C., Zhao, L., Zhong, Y., Zhang, Z., & Ma, Z. (2026). SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(3), 2119-2127. https://doi.org/10.1609/aaai.v40i3.37194

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems