ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion
DOI:
https://doi.org/10.1609/aaai.v40i16.38321Abstract
Infrared and Visible Image Fusion (IVIF) produces enhanced images by fusing complementary visual information. However, most existing methods generate fixed outputs and cannot flexibly adapt to user-specific requirements. Recent text-guided approaches offer partial control but are limited to global or semantic levels, lacking instance-level control. This limitation arises from two challenges: first, the lack of datasets that directly link textual instructions with corresponding spatial annotations, and second, the use of coarse cross-modal alignment methods that struggle to precisely match textual instructions with visual features. To overcome these challenges, we propose ControlFuse, a controllable IVIF framework enabling multi-granularity fusion across global, semantic, and instance levels, guided by user instructions. First, we construct an automated multi-granularity dataset that provides explicit textual-mask correspondences at these three levels. Second, inspired by manifold geometry, we design a Multimodal Feature Interaction Module (MFIM) comprising Feature Manifold Converter (FMC) and Curvature-Guided Interaction (CGI). FMC projects textual and visual features into a unified manifold space, while CGI leverages manifold curvature as a geometric cue to refine cross-modal alignment. Extensive experiments validate ControlFuse, outperforming state-of-the-art methods in robustness and flexibility.Downloads
Published
2026-03-14
How to Cite
Zhao, L., Zhang, X., & Wang, Z. (2026). ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13199–13207. https://doi.org/10.1609/aaai.v40i16.38321
Issue
Section
AAAI Technical Track on Computer Vision XIII