DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification
DOI:
https://doi.org/10.1609/aaai.v38i16.29716Keywords:
NLP: Speech, ML: Deep Neural Architectures and Foundation ModelsAbstract
Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.gitDownloads
Published
2024-03-24
How to Cite
Alex, T., Ahmed, S., Mustafa, A., Awais, M., & Jackson, P. J. (2024). DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17647-17655. https://doi.org/10.1609/aaai.v38i16.29716
Issue
Section
AAAI Technical Track on Natural Language Processing I