D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

Shuochen Chang; Xiaofeng Zhang; Qingyang Liu; Li Niu

doi:10.1609/aaai.v40i24.39080

Authors

Shuochen Chang MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Xiaofeng Zhang MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Qingyang Liu MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Li Niu MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v40i24.39080

Abstract

Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D³ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D³ToM uses decider tokens—the tokens generated in the previous denoising step—to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D³ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D³ToM accelerates inference while preserving competitive performance.

D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information