Liu, C., Zhang, J., Li, C., Zhou, Z., Wu, S., Huang, S., & Duan, H. (2026). TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18452–18459. https://doi.org/10.1609/aaai.v40i22.38910