On the Expressive Flexibility of Self-Attention Matrices

Valerii Likhosherstov; Krzysztof Choromanski; Adrian Weller

doi:10.1609/aaai.v37i7.26055

Authors

Valerii Likhosherstov University of Cambridge
Krzysztof Choromanski Google Brain
Adrian Weller University of Cambridge The Alan Turing Institute

DOI:

https://doi.org/10.1609/aaai.v37i7.26055

Keywords:

ML: Deep Learning Theory, ML: Kernel Methods

Abstract

Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. We ask the following question: Can self-attention matrix approximate arbitrary patterns? How small is the query dimension d required for such approximation? Our first result shows that the task of deciding whether approximation of a given pattern is possible or not is NP-hard for a fixed d greater than one. In practice, self-attention matrix typically exhibits two properties: it is sparse, and it changes dynamically depending on the input to the module. Motivated by this observation, we show that the self-attention matrix can provably approximate sparse matrices. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, d grows only logarithmically with the sequence length n.

On the Expressive Flexibility of Self-Attention Matrices

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription