DigitalLLaVA: Incorporating Digital Cognition Capability for Physical World Comprehension in Multimodal LLMs

Authors

  • Shiyu Li School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China
  • Pengxu Wei Pengcheng Laboratory, Shenzhen, China Sun Yat-Sen University, Guangzhou, China
  • Pengchong Qiao School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China
  • Chang Liu Department of Automation and BNRist, Tsinghua University, Beijing, China
  • Jie Chen School of Electronic and Computer Engineering, Peking University, Shenzhen, China Pengcheng Laboratory, Shenzhen, China AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32522

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable cognitive capabilities in various cross-modal tasks.However, existing MLLMs struggle with tasks that require physical digital cognition, such as accurately reading an electric meter or pressure gauge. This limitation significantly reduces their effectiveness in practical applications like industrial monitoring and home energy management, where digital sensors are not feasible. For humans, physical digits are artificially defined quantities presented on specific carriers, which require training to recognize. As existing MLLMs are only pre-trained in the manner of object recognition, they fail to comprehend the relationship between digital carriers and their reading. To this end, referring to human behavior, we propose a novel DigitalLLaVA method to explicitly inject digital cognitive abilities into MLLMs in a two-step manner. In the first step, to improve the MLLM's understanding of physical digit carriers, we propose a digit carrier mapping method. This step utilizes object-level text-image pairs to enhance the model's comprehension of objects containing physical digits. For the second step, unlike previous methods that rely on sequential digital prediction or digit regression, we propose a 32 bit floating point simulation approach that treats digit prediction as a whole. Using digit-level text-image pairs, we train three float heads to predict 32-bit floating-point numbers using 0/1 binary classification. This step significantly reduces the search space, making the prediction process more robust and straightforward. Being simple but effective, our method can identify very precise metrics (i.e., accurate to ±0.001) and provide floating-point results, showing its applicability in digital carrier domains.

Downloads

Published

2025-04-11

How to Cite

Li, S., Wei, P., Qiao, P., Liu, C., & Chen, J. (2025). DigitalLLaVA: Incorporating Digital Cognition Capability for Physical World Comprehension in Multimodal LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 4932–4940. https://doi.org/10.1609/aaai.v39i5.32522

Issue

Section

AAAI Technical Track on Computer Vision IV