UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Authors

  • Wei Zhang Nanjing University of Science and Technology
  • Yeying Jin Tencent
  • Xin Li University of Science and Technology of China
  • Yan Zhang ByteDance Inc.
  • Xiaofeng Cong Southeast University
  • Cong Wang University of California, San Francisco
  • Fengcai Qiao National University of Defense Technology
  • Zhichao Lian Nanjing University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i15.38279

Abstract

Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance.

Downloads

Published

2026-03-14

How to Cite

Zhang, W., Jin, Y., Li, X., Zhang, Y., Cong, X., Wang, C., … Lian, Z. (2026). UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12816–12824. https://doi.org/10.1609/aaai.v40i15.38279

Issue

Section

AAAI Technical Track on Computer Vision XII