(1)
Pande, M.; Budhraja, A.; Nema, P.; Kumar, P.; Khapra, M. M. The Heads Hypothesis: A Unifying Statistical Approach Towards Understanding Multi-Headed Attention in BERT. AAAI 2021, 35, 13613-13621.