DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification

Tony Alex; Sara Ahmed; Armin Mustafa; Muhammad Awais; Philip JB Jackson

doi:10.1609/aaai.v38i16.29716

Authors

Tony Alex University of Surrey
Sara Ahmed University of Surrey
Armin Mustafa University of Surrey
Muhammad Awais University of Surrey
Philip JB Jackson University of Surrey

DOI:

https://doi.org/10.1609/aaai.v38i16.29716

Keywords:

NLP: Speech, ML: Deep Neural Architectures and Foundation Models

Abstract

Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git

DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription