Stop Mixing Things Up! BISCUIT Teaches Vision-Language Models to Learn New Concepts from Images on the Spot

Jiahua Bao; Siyao Cheng; Jiaxing Du; Yuhang Jia; Boyang Niu; Zeming Lang; Changjiang He; Hao Zhang; Jie Liu

doi:10.1609/aaai.v40i4.37226

Authors

Jiahua Bao Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China
Siyao Cheng Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China
Jiaxing Du Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
Yuhang Jia Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
Boyang Niu Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
Zeming Lang Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
Changjiang He Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China
Hao Zhang Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China
Jie Liu Research Center of Ubiquitous Computing and Intelligent Systems, Harbin Institute of Technology, China National Key Laboratory of Smart Farming Technology and Systems, China China Mobile 5G Institute, China

DOI:

https://doi.org/10.1609/aaai.v40i4.37226

Abstract

Vision-Language Models (VLMs) have achieved impressive performance across various tasks, but often struggle to apply newly introduced visual concepts during inference. A common failure pattern is what we call Mixing Things Up: VLMs frequently confuse concept names, resulting in vague descriptions and failure to ground the concept correctly. Existing approaches mainly address person-related concepts through text prompts or tokenizer modifications. However, VLMs still miss or misinterpret untrained visual concepts, underscoring the need to learn new concepts directly from visual input, without relying on prior textual injection. To overcome these limitations, we propose BISCUIT (Basis-aligned Inference through Structured Concept Unification and Identification-aware Tuning), a two-step training method. Step I proposes a dual-stream structure-aware vision encoder that fuses RGB and edge-based embeddings within a shared basis space to enhance concept recognition. Step II enhances generation quality through identification-aware tuning, which encourages alignment between the generated text and the newly introduced visual concepts. Existing methods mainly focus on person concepts and lack comprehensive evaluation across diverse visual categories. We further propose a benchmark BiscuitVQA to evaluate VLMs performance on recognizing and applying novel image-introduced concepts across diverse concept types and task types, including real people, cartoons, animals, and symbolic content. We apply BISCUIT to LLaVA-1.5 and Qwen2.5-VL, achieving competitive results among open-source models and narrowing the gap to Gemini-2.5 and GPT-4o. Interestingly, our BISCUIT maintains strong generalization, showing minimal degradation on other downstream tasks.

Stop Mixing Things Up! BISCUIT Teaches Vision-Language Models to Learn New Concepts from Images on the Spot

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information