No Head Left Behind – Multi-Head Alignment Distillation for Transformers

Authors

  • Tianyang Zhao AWS AI Labs University of California, Los Angeles
  • Kunwar Yashraj Singh AWS AI Labs
  • Srikar Appalaraju AWS AI Labs
  • Peng Tang AWS AI Labs
  • Vijay Mahadevan AWS AI Labs
  • R. Manmatha AWS AI Labs
  • Ying Nian Wu AWS AI Labs University of California, Los Angeles

DOI:

https://doi.org/10.1609/aaai.v38i7.28583

Keywords:

CV: Language and Vision, ML: Deep Learning Algorithms, ML: Representation Learning, NLP: Language Grounding & Multi-modal NLP

Abstract

Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large vision-language (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher's first attention head instructs the Student's first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.

Published

2024-03-24

How to Cite

Zhao, T., Singh, K. Y., Appalaraju, S., Tang, P., Mahadevan, V., Manmatha, R., & Wu, Y. N. (2024). No Head Left Behind – Multi-Head Alignment Distillation for Transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7514-7524. https://doi.org/10.1609/aaai.v38i7.28583

Issue

Section

AAAI Technical Track on Computer Vision VI