McHirc: A Multimodal Benchmark for Chinese Idiom Reading Comprehension

Authors

  • Tongguan Wang Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China
  • Mingmin Wu College of Informatics, Huazhong Agricultural University, Wuhan, China
  • Guixin Su College of Informatics, Huazhong Agricultural University, Wuhan, China
  • Dongyu Su College of Informatics, Huazhong Agricultural University, Wuhan, China
  • Yuxue Hu Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China
  • Zhongqiang Huang College of Informatics, Huazhong Agricultural University, Wuhan, China
  • Ying Sha Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China

DOI:

https://doi.org/10.1609/aaai.v39i24.34728

Abstract

The performance of various tasks of natural language processing has greatly improved with the emergence of large language models. However, there is still much room for improvement in understanding certain specific linguistic phenomena, such as Chinese idioms, which are usually composed of four characters. Chinese idioms are difficult to understand due to semantic gaps between their literal and actual meanings. Researchers have proposed the Chinese idiom reading comprehension task to examine the ability of large language models to represent and understand Chinese idioms. The task requires choosing the correct Chinese idiom from a list of candidates to complete the sentence. The current research mainly focuses on text-based idiom comprehension. Nevertheless, there are many idiom application scenarios that combine images and text, and we believe that the corresponding images are beneficial for the model's understanding of the idioms. Therefore, to address the above problems, we first construct a large-scale Multimodal Chinese Idiom Reading Comprehension dataset (MChIRC), which contains a total of 44,433 image-text pairs covering 2,926 idioms. Then, we propose a Dual-Contrastive Idiom Graph Network (DCIGN), which employs a dual-contrastive learning module to align the text and image features corresponding to the same Chinese idiom at both coarse and fine levels, while utilizing a graph structure to capture the semantic relationships between idiom candidates. Finally, we use a cross-attention module to fuse multimodal features with graph features of candidate idioms to predict correct answers. The authoritativeness of MChIRC and the effectiveness of DCIGN are demonstrated through a variety of experiments, which provides a new benchmark for the multimodal Chinese idiom reading comprehension task.

Downloads

Published

2025-04-11

How to Cite

Wang, T., Wu, M., Su, G., Su, D., Hu, Y., Huang, Z., & Sha, Y. (2025). McHirc: A Multimodal Benchmark for Chinese Idiom Reading Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25398–25406. https://doi.org/10.1609/aaai.v39i24.34728

Issue

Section

AAAI Technical Track on Natural Language Processing III