Multimodal Promptable Token Merging for Diffusion Models

Authors

  • Cheng-Yao Hong Academia Sinica
  • Tyng-Luh Liu Academia Sinica

DOI:

https://doi.org/10.1609/aaai.v39i16.33894

Abstract

Token compression techniques, such as token merging and pruning, are essential for alleviating the substantial computational burden caused by the proliferation of tokens within attention mechanisms. However, current methods often rely on token-to-token distances or similarity metrics to evaluate token importance, which is inadequate in the context of modern promptable designs and frameworks that are gaining prominence. To address this limitation, we introduce a novel and effective merging strategy called “Multimodal Promptable Token Merging” (MPTM). The proposed method leverages a multimodal, prompt-centric methodology, assessing the proximity between tokens of each input modality and the multimodal prompt to efficiently eliminate redundant tokens while preserving those rich in information. Extensive experiments demonstrate that MPTM significantly reduces computational costs without compromising essential information in generative image tasks. When integrated into diffusion-based detection architectures, MPTM outperforms existing state-of-the-art methods by 2.3% in object detection tasks. Additionally, when applied to multimodal diffusion models, MPTM maintains high-quality output while achieving a 2.9-fold increase in throughput, highlighting its versatility.

Downloads

Published

2025-04-11

How to Cite

Hong, C.-Y., & Liu, T.-L. (2025). Multimodal Promptable Token Merging for Diffusion Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(16), 17231–17239. https://doi.org/10.1609/aaai.v39i16.33894

Issue

Section

AAAI Technical Track on Machine Learning II