DigitalLLaVA: Incorporating Digital Cognition Capability for Physical World Comprehension in Multimodal LLMs

Shiyu Li; Pengxu Wei; Pengchong Qiao; Chang Liu; Jie Chen

doi:10.1609/aaai.v39i5.32522

Authors

Shiyu Li School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China
Pengxu Wei Pengcheng Laboratory, Shenzhen, China Sun Yat-Sen University, Guangzhou, China
Pengchong Qiao School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
Chang Liu Department of Automation and BNRist, Tsinghua University, Beijing, China
Jie Chen School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32522

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable cognitive capabilities in various cross-modal tasks.However, existing MLLMs struggle with tasks that require physical digital cognition, such as accurately reading an electric meter or pressure gauge. This limitation significantly reduces their effectiveness in practical applications like industrial monitoring and home energy management, where digital sensors are not feasible. For humans, physical digits are artificially defined quantities presented on specific carriers, which require training to recognize. As existing MLLMs are only pre-trained in the manner of object recognition, they fail to comprehend the relationship between digital carriers and their reading. To this end, referring to human behavior, we propose a novel DigitalLLaVA method to explicitly inject digital cognitive abilities into MLLMs in a two-step manner. In the first step, to improve the MLLM's understanding of physical digit carriers, we propose a digit carrier mapping method. This step utilizes object-level text-image pairs to enhance the model's comprehension of objects containing physical digits. For the second step, unlike previous methods that rely on sequential digital prediction or digit regression, we propose a 32 bit floating point simulation approach that treats digit prediction as a whole. Using digit-level text-image pairs, we train three float heads to predict 32-bit floating-point numbers using 0/1 binary classification. This step significantly reduces the search space, making the prediction process more robust and straightforward. Being simple but effective, our method can identify very precise metrics (i.e., accurate to ±0.001) and provide floating-point results, showing its applicability in digital carrier domains.

DigitalLLaVA: Incorporating Digital Cognition Capability for Physical World Comprehension in Multimodal LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information