TY - JOUR
AU - Likhosherstov, Valerii
AU - Choromanski, Krzysztof
AU - Weller, Adrian
PY - 2023/06/26
Y2 - 2024/04/21
TI - On the Expressive Flexibility of Self-Attention Matrices
JF - Proceedings of the AAAI Conference on Artificial Intelligence
JA - AAAI
VL - 37
IS - 7
SE - AAAI Technical Track on Machine Learning II
DO - 10.1609/aaai.v37i7.26055
UR - https://ojs.aaai.org/index.php/AAAI/article/view/26055
SP - 8773-8781
AB - Transformer networks are able to capture patterns in data coming from many domains (text, images, videos, proteins, etc.) with little or no change to architecture components. We perform a theoretical analysis of the core component responsible for signal propagation between elements, i.e. the self-attention matrix. We ask the following question: Can self-attention matrix approximate arbitrary patterns? How small is the query dimension d required for such approximation? Our first result shows that the task of deciding whether approximation of a given pattern is possible or not is NP-hard for a fixed d greater than one. In practice, self-attention matrix typically exhibits two properties: it is sparse, and it changes dynamically depending on the input to the module. Motivated by this observation, we show that the self-attention matrix can provably approximate sparse matrices. While the parameters of self-attention are fixed, various sparse matrices can be approximated by only modifying the inputs. Our proof is based on the random projection technique and uses the seminal Johnson-Lindenstrauss lemma. In particular, we show that, in order to approximate any sparse matrix up to a given precision defined in terms of preserving matrix element ratios, d grows only logarithmically with the sequence length n.
ER -