Cached Transformers: Improving Transformers with Differentiable Memory Cachde
DOI:
https://doi.org/10.1609/aaai.v38i15.29636Keywords:
ML: Transfer, Domain Adaptation, Multi-Task Learning, ML: Deep Learning Algorithms, ML: Transparent, Interpretable, Explainable ML, ML: Unsupervised & Self-Supervised LearningAbstract
This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.Downloads
Published
2024-03-24
How to Cite
Zhang, Z., Shao, W., Ge, Y., Wang, X., Gu, J., & Luo, P. (2024). Cached Transformers: Improving Transformers with Differentiable Memory Cachde. Proceedings of the AAAI Conference on Artificial Intelligence, 38(15), 16935–16943. https://doi.org/10.1609/aaai.v38i15.29636
Issue
Section
AAAI Technical Track on Machine Learning VI