Union Is Strength! Unite the Power of LLMs and MLLMs for Chart Question Answering

Jiapeng Liu; Liang Li; Shihao Rao; Xiyan Gao; Weixin Guan; Bing Li; Can Ma

doi:10.1609/aaai.v39i5.32584

Authors

Jiapeng Liu Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Liang Li Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Shihao Rao Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Xiyan Gao Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Weixin Guan Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Bing Li Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Can Ma Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32584

Abstract

Chart Question Answering (CQA) requires models to perform chart perception and reasoning. Recent studies driven by Large Language Models (LLMs) have dominated CQA. These include employing more cognitively capable LLMs for indirectly reasoning over transformed charts, i.e., tables, and directly perceiving charts utilizing Multimodal Large Language Models (MLLMs) with a wider perceptual range. Yet, they often encounter bottlenecks due to the limitation of the receptive field of LLMs and the fragility of the complex reasoning of some MLLMs. To unite the strengths of LLMs and MLLMs to complement each other's limitations, we propose Synergy, a framework that unites the power of both models for CQA. Synergy first unites the chart with a table as the augmented perceptual signal. Next, it unites LLMs and MLLMs, scheduling the former to decompose a question into subquestions and the latter to answer these by perceiving the chart. Lastly, it operates LLMs to summarize the subquestion-answer pairs to refine the final answer. Extensive experimental results on popular CharQA and PlotQA benchmarks reveal that, with the power of union, Synergy outperforms strong competitors and achieves superior boosts over naive MLLMs by uniting them with a smaller LLM.

Union Is Strength! Unite the Power of LLMs and MLLMs for Chart Question Answering

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information