ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

Libo Zhao; Xiaoli Zhang; Zeyu Wang

doi:10.1609/aaai.v40i16.38321

Authors

Libo Zhao Jilin University
Xiaoli Zhang Jilin University
Zeyu Wang Dalian Minzu University

DOI:

https://doi.org/10.1609/aaai.v40i16.38321

Abstract

Infrared and Visible Image Fusion (IVIF) produces enhanced images by fusing complementary visual information. However, most existing methods generate fixed outputs and cannot flexibly adapt to user-specific requirements. Recent text-guided approaches offer partial control but are limited to global or semantic levels, lacking instance-level control. This limitation arises from two challenges: first, the lack of datasets that directly link textual instructions with corresponding spatial annotations, and second, the use of coarse cross-modal alignment methods that struggle to precisely match textual instructions with visual features. To overcome these challenges, we propose ControlFuse, a controllable IVIF framework enabling multi-granularity fusion across global, semantic, and instance levels, guided by user instructions. First, we construct an automated multi-granularity dataset that provides explicit textual-mask correspondences at these three levels. Second, inspired by manifold geometry, we design a Multimodal Feature Interaction Module (MFIM) comprising Feature Manifold Converter (FMC) and Curvature-Guided Interaction (CGI). FMC projects textual and visual features into a unified manifold space, while CGI leverages manifold curvature as a geometric cue to refine cross-modal alignment. Extensive experiments validate ControlFuse, outperforming state-of-the-art methods in robustness and flexibility.

ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information