No Head Left Behind – Multi-Head Alignment Distillation for Transformers

Tianyang Zhao; Kunwar Yashraj Singh; Srikar Appalaraju; Peng Tang; Vijay Mahadevan; R. Manmatha; Ying Nian Wu

doi:10.1609/aaai.v38i7.28583

Authors

Tianyang Zhao AWS AI Labs University of California, Los Angeles
Kunwar Yashraj Singh AWS AI Labs
Srikar Appalaraju AWS AI Labs
Peng Tang AWS AI Labs
Vijay Mahadevan AWS AI Labs
R. Manmatha AWS AI Labs
Ying Nian Wu AWS AI Labs University of California, Los Angeles

DOI:

https://doi.org/10.1609/aaai.v38i7.28583

Keywords:

CV: Language and Vision, ML: Deep Learning Algorithms, ML: Representation Learning, NLP: Language Grounding & Multi-modal NLP

Abstract

Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large vision-language (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher's first attention head instructs the Student's first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.

No Head Left Behind – Multi-Head Alignment Distillation for Transformers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription