McHirc: A Multimodal Benchmark for Chinese Idiom Reading Comprehension

Tongguan Wang; Mingmin Wu; Guixin Su; Dongyu Su; Yuxue Hu; Zhongqiang Huang; Ying Sha

doi:10.1609/aaai.v39i24.34728

Authors

Tongguan Wang Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China
Mingmin Wu College of Informatics, Huazhong Agricultural University, Wuhan, China
Guixin Su College of Informatics, Huazhong Agricultural University, Wuhan, China
Dongyu Su College of Informatics, Huazhong Agricultural University, Wuhan, China
Yuxue Hu Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China
Zhongqiang Huang College of Informatics, Huazhong Agricultural University, Wuhan, China
Ying Sha Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China College of Informatics, Huazhong Agricultural University, Wuhan, China

DOI:

https://doi.org/10.1609/aaai.v39i24.34728

Abstract

The performance of various tasks of natural language processing has greatly improved with the emergence of large language models. However, there is still much room for improvement in understanding certain specific linguistic phenomena, such as Chinese idioms, which are usually composed of four characters. Chinese idioms are difficult to understand due to semantic gaps between their literal and actual meanings. Researchers have proposed the Chinese idiom reading comprehension task to examine the ability of large language models to represent and understand Chinese idioms. The task requires choosing the correct Chinese idiom from a list of candidates to complete the sentence. The current research mainly focuses on text-based idiom comprehension. Nevertheless, there are many idiom application scenarios that combine images and text, and we believe that the corresponding images are beneficial for the model's understanding of the idioms. Therefore, to address the above problems, we first construct a large-scale Multimodal Chinese Idiom Reading Comprehension dataset (MChIRC), which contains a total of 44,433 image-text pairs covering 2,926 idioms. Then, we propose a Dual-Contrastive Idiom Graph Network (DCIGN), which employs a dual-contrastive learning module to align the text and image features corresponding to the same Chinese idiom at both coarse and fine levels, while utilizing a graph structure to capture the semantic relationships between idiom candidates. Finally, we use a cross-attention module to fuse multimodal features with graph features of candidate idioms to predict correct answers. The authoritativeness of MChIRC and the effectiveness of DCIGN are demonstrated through a variety of experiments, which provides a new benchmark for the multimodal Chinese idiom reading comprehension task.

McHirc: A Multimodal Benchmark for Chinese Idiom Reading Comprehension

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information