Multimodal Promptable Token Merging for Diffusion Models

Cheng-Yao Hong; Tyng-Luh Liu

doi:10.1609/aaai.v39i16.33894

Authors

Cheng-Yao Hong Academia Sinica
Tyng-Luh Liu Academia Sinica

DOI:

https://doi.org/10.1609/aaai.v39i16.33894

Abstract

Token compression techniques, such as token merging and pruning, are essential for alleviating the substantial computational burden caused by the proliferation of tokens within attention mechanisms. However, current methods often rely on token-to-token distances or similarity metrics to evaluate token importance, which is inadequate in the context of modern promptable designs and frameworks that are gaining prominence. To address this limitation, we introduce a novel and effective merging strategy called “Multimodal Promptable Token Merging” (MPTM). The proposed method leverages a multimodal, prompt-centric methodology, assessing the proximity between tokens of each input modality and the multimodal prompt to efficiently eliminate redundant tokens while preserving those rich in information. Extensive experiments demonstrate that MPTM significantly reduces computational costs without compromising essential information in generative image tasks. When integrated into diffusion-based detection architectures, MPTM outperforms existing state-of-the-art methods by 2.3% in object detection tasks. Additionally, when applied to multimodal diffusion models, MPTM maintains high-quality output while achieving a 2.9-fold increase in throughput, highlighting its versatility.

Multimodal Promptable Token Merging for Diffusion Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information