[1]
J. Zhuang, “ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming”, AAAI, vol. 39, no. 10, pp. 11049–11057, Apr. 2025.