Liu, Chenghao, et al. “TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 22, Mar. 2026, pp. 18452-9, doi:10.1609/aaai.v40i22.38910.