Easy for Children, Hard for AI: The Limits of Multimodal LLMs in Early Childhood Learning

Authors

  • Jingping Liu Sun Yat-sen University
  • Xueyan Wu East China University of Science and Technology
  • Hanxuan Chen Hunan University
  • Ziyan Liu East China University of Science and Technology
  • Zhangquan Chen Tsinghua University
  • Ronghao Chen Peking University
  • Huacan Wang University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i38.40479

Abstract

Early childhood is a critical stage for cognitive development, involving core skills such as visual perception and reasoning. While multimodal large language models (MLLMs) have made rapid progress in various general-purpose tasks, their ability to support early education remains largely underexplored. Existing research on child-related AI largely centers on modeling language, emotion, or behavior, with limited focus on evaluating cognitive tasks relevant to early learning. To address this gap, we propose ChildBench, a multimodal benchmark designed to assess models on tasks inspired by early childhood cognitive development. It covers five key domains through ten tasks, including spatial reasoning, visual reasoning, visual discrimination, counting skills, and visual tracking. The benchmark includes 4,890 carefully constructed images and 5,346 manually annotated samples, ensuring both diversity and age-appropriate content. We evaluate a range of state-of-the-art (SoTA) open-source and closed-source MLLMs—including GPT-4o, Gemini, and Qwen2.5-VL—on ChildBench. Despite strong performance on other benchmarks, the best 7B-parameter model with LoRA tuning achieves only 52.01% accuracy, far below the 96% achieved by 5-year-old children. These results reveal critical limitations in fine-grained perception and reasoning. We further analyze failure cases and discuss directions for future model development.

Downloads

Published

2026-03-14

How to Cite

Liu, J., Wu, X., Chen, H., Liu, Z., Chen, Z., Chen, R., & Wang, H. (2026). Easy for Children, Hard for AI: The Limits of Multimodal LLMs in Early Childhood Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32078–32086. https://doi.org/10.1609/aaai.v40i38.40479

Issue

Section

AAAI Technical Track on Natural Language Processing III