Pande, M., Budhraja, A., Nema, P., Kumar, P., & Khapra, M. M. (2021). The Heads Hypothesis: A Unifying Statistical Approach Towards Understanding Multi-Headed Attention in BERT. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13613-13621. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17605