Pande, Madhura, et al. “The Heads Hypothesis: A Unifying Statistical Approach Towards Understanding Multi-Headed Attention in BERT”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 15, May 2021, pp. 13613-21, doi:10.1609/aaai.v35i15.17605.