On the Expressive Flexibility of Self-Attention Matrices


  • Valerii Likhosherstov University of Cambridge
  • Krzysztof Choromanski Google Brain
  • Adrian Weller University of Cambridge The Alan Turing Institute




ML: Deep Learning Theory, ML: Kernel Methods


Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. We ask the following question: Can self-attention matrix approximate arbitrary patterns? How small is the query dimension d required for such approximation? Our first result shows that the task of deciding whether approximation of a given pattern is possible or not is NP-hard for a fixed d greater than one. In practice, self-attention matrix typically exhibits two properties: it is sparse, and it changes dynamically depending on the input to the module. Motivated by this observation, we show that the self-attention matrix can provably approximate sparse matrices. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, d grows only logarithmically with the sequence length n.




How to Cite

Likhosherstov, V., Choromanski, K., & Weller, A. (2023). On the Expressive Flexibility of Self-Attention Matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7), 8773-8781. https://doi.org/10.1609/aaai.v37i7.26055



AAAI Technical Track on Machine Learning II