On the Expressive Flexibility of Self-Attention Matrices

Authors

  • Valerii Likhosherstov University of Cambridge
  • Krzysztof Choromanski Google Brain
  • Adrian Weller University of Cambridge The Alan Turing Institute

DOI:

https://doi.org/10.1609/aaai.v37i7.26055

Keywords:

ML: Deep Learning Theory, ML: Kernel Methods

Abstract

Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. We ask the following question: Can self-attention matrix approximate arbitrary patterns? How small is the query dimension d required for such approximation? Our first result shows that the task of deciding whether approximation of a given pattern is possible or not is NP-hard for a fixed d greater than one. In practice, self-attention matrix typically exhibits two properties: it is sparse, and it changes dynamically depending on the input to the module. Motivated by this observation, we show that the self-attention matrix can provably approximate sparse matrices. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, d grows only logarithmically with the sequence length n.

Downloads

Published

2023-06-26

How to Cite

Likhosherstov, V., Choromanski, K., & Weller, A. (2023). On the Expressive Flexibility of Self-Attention Matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7), 8773-8781. https://doi.org/10.1609/aaai.v37i7.26055

Issue

Section

AAAI Technical Track on Machine Learning II