Pande, M., A. Budhraja, P. Nema, P. Kumar, and M. M. Khapra. “The Heads Hypothesis: A Unifying Statistical Approach Towards Understanding Multi-Headed Attention in BERT”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 15, May 2021, pp. 13613-21, doi:10.1609/aaai.v35i15.17605.