MemeMatch: A Large-Scale Dual-Context Multimodal Dataset and Retrieval System for Internet Memes

Authors

  • Do Tri An Le Wabash College
  • Donát Ákos Köller Budapest University of Technology and Economics
  • Qixin Deng Wabash College
  • Roland Molontay Budapest University of Technology and Economics

DOI:

https://doi.org/10.1609/icwsm.v20i1.42785

Abstract

We introduce MemeMatch, a large-scale multimodal meme dataset and retrieval system that bridges meme collection, annotation, and analysis in a unified pipeline. The dataset contains nearly one million image-with-text memes from Reddit’s r/Memes (2018–2023) and ImgFlip, with rich metadata. Each meme is decomposed into two semantic contexts: local context, capturing the editable text payload (overlay text and title), and global context, capturing the underlying visual substrate or template semantics. Both are enriched with transformer-based annotations, including 14-dimensional sentiment and emotion vectors, BERTopic-derived topics, and zero-shot usage-intent labels. This structured representation supports exploratory analysis and context-aware retrieval by natural language or image query.

Downloads

Published

2026-05-25

How to Cite

Le, D. T. A., Köller, D. Ákos, Deng, Q., & Molontay, R. (2026). MemeMatch: A Large-Scale Dual-Context Multimodal Dataset and Retrieval System for Internet Memes. Proceedings of the International AAAI Conference on Web and Social Media, 20(1), 2828–2838. https://doi.org/10.1609/icwsm.v20i1.42785