Visual Bridge: Universal Visual Perception Representations Generating

Authors

  • Yilin Gao Shanghai University
  • Shuguang Dou Huawei Technologies Co.,Ltd.
  • Junzhou Li University of Science and Technology of China
  • Zhiheng Yu Huawei Technologies Co.,Ltd.
  • Yin Li Huawei Technologies Co., Ltd.
  • Dongsheng Jiang Huawei Technologies Co.,Ltd.
  • Shugong Xu Xi'an Jiaotong-Liverpool University

DOI:

https://doi.org/10.1609/aaai.v40i25.39268

Abstract

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.

Downloads

Published

2026-03-14

How to Cite

Gao, Y., Dou, S., Li, J., Yu, Z., Li, Y., Jiang, D., & Xu, S. (2026). Visual Bridge: Universal Visual Perception Representations Generating. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 21234–21242. https://doi.org/10.1609/aaai.v40i25.39268

Issue

Section

AAAI Technical Track on Machine Learning II