DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification


  • Tony Alex University of Surrey
  • Sara Ahmed University of Surrey
  • Armin Mustafa University of Surrey
  • Muhammad Awais University of Surrey
  • Philip JB Jackson University of Surrey




NLP: Speech, ML: Deep Neural Architectures and Foundation Models


Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git



How to Cite

Alex, T., Ahmed, S., Mustafa, A., Awais, M., & Jackson, P. J. (2024). DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17647-17655. https://doi.org/10.1609/aaai.v38i16.29716



AAAI Technical Track on Natural Language Processing I