Union Is Strength! Unite the Power of LLMs and MLLMs for Chart Question Answering
DOI:
https://doi.org/10.1609/aaai.v39i5.32584Abstract
Chart Question Answering (CQA) requires models to perform chart perception and reasoning. Recent studies driven by Large Language Models (LLMs) have dominated CQA. These include employing more cognitively capable LLMs for indirectly reasoning over transformed charts, i.e., tables, and directly perceiving charts utilizing Multimodal Large Language Models (MLLMs) with a wider perceptual range. Yet, they often encounter bottlenecks due to the limitation of the receptive field of LLMs and the fragility of the complex reasoning of some MLLMs. To unite the strengths of LLMs and MLLMs to complement each other's limitations, we propose Synergy, a framework that unites the power of both models for CQA. Synergy first unites the chart with a table as the augmented perceptual signal. Next, it unites LLMs and MLLMs, scheduling the former to decompose a question into subquestions and the latter to answer these by perceiving the chart. Lastly, it operates LLMs to summarize the subquestion-answer pairs to refine the final answer. Extensive experimental results on popular CharQA and PlotQA benchmarks reveal that, with the power of union, Synergy outperforms strong competitors and achieves superior boosts over naive MLLMs by uniting them with a smaller LLM.Published
2025-04-11
How to Cite
Liu, J., Li, L., Rao, S., Gao, X., Guan, W., Li, B., & Ma, C. (2025). Union Is Strength! Unite the Power of LLMs and MLLMs for Chart Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 5487-5495. https://doi.org/10.1609/aaai.v39i5.32584
Issue
Section
AAAI Technical Track on Computer Vision IV