Stop Mixing Things Up! BISCUIT Teaches Vision-Language Models to Learn New Concepts from Images on the Spot

Authors

  • Jiahua Bao Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China
  • Siyao Cheng Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China
  • Jiaxing Du Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
  • Yuhang Jia Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
  • Boyang Niu Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
  • Zeming Lang Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
  • Changjiang He Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
  • Hao Zhang Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China
  • Jie Liu Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China

DOI:

https://doi.org/10.1609/aaai.v40i4.37226

Abstract

Vision-Language Models (VLMs) have achieved impressive performance across various tasks, but often struggle to apply newly introduced visual concepts during inference. A common failure pattern is what we call Mixing Things Up: VLMs frequently confuse concept names, resulting in vague descriptions and failure to ground the concept correctly. Existing approaches mainly address person-related concepts through text prompts or tokenizer modifications. However, VLMs still miss or misinterpret untrained visual concepts, underscoring the need to learn new concepts directly from visual input, without relying on prior textual injection. To overcome these limitations, we propose BISCUIT (Basis-aligned Inference through Structured Concept Unification and Identification-aware Tuning), a two-step training method. Step I proposes a dual-stream structure-aware vision encoder that fuses RGB and edge-based embeddings within a shared basis space to enhance concept recognition. Step II enhances generation quality through identification-aware tuning, which encourages alignment between the generated text and the newly introduced visual concepts. Existing methods mainly focus on person concepts and lack comprehensive evaluation across diverse visual categories. We further propose a benchmark BiscuitVQA to evaluate VLMs performance on recognizing and applying novel image-introduced concepts across diverse concept types and task types, including real people, cartoons, animals, and symbolic content. We apply BISCUIT to LLaVA-1.5 and Qwen2.5-VL, achieving competitive results among open-source models and narrowing the gap to Gemini-2.5 and GPT-4o. Interestingly, our BISCUIT maintains strong generalization, showing minimal degradation on other downstream tasks.

Downloads

Published

2026-03-14

How to Cite

Bao, J., Cheng, S., Du, J., Jia, Y., Niu, B., Lang, Z., He, C., Zhang, H., & Liu, J. (2026). Stop Mixing Things Up! BISCUIT Teaches Vision-Language Models to Learn New Concepts from Images on the Spot. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2408-2416. https://doi.org/10.1609/aaai.v40i4.37226

Issue

Section

AAAI Technical Track on Computer Vision I