D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs
DOI:
https://doi.org/10.1609/aaai.v40i24.39080Abstract
Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D³ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D³ToM uses decider tokens—the tokens generated in the previous denoising step—to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D³ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D³ToM accelerates inference while preserving competitive performance.Published
2026-03-14
How to Cite
Chang, S., Zhang, X., Liu, Q., & Niu, L. (2026). D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 19961–19969. https://doi.org/10.1609/aaai.v40i24.39080
Issue
Section
AAAI Technical Track on Machine Learning I