Large-Scale Multimodal Content Analysis and Annotation with Vision-Language Models

Authors

  • Harsha Nemani International Institute of Information Technology, Hyderabad
  • Kiran Garimella Rutgers University

DOI:

https://doi.org/10.1609/icwsm.v20i1.42718

Abstract

Most social media research still treats content as text, even though images and videos carry a large share of what people actually see and share. This gap is especially serious for WhatsApp in India, where political communication often travels as posters, memes, screenshots, short clips, and forwarded videos in Hindi and other low-resource languages. Text-only analysis can therefore miss a large fraction of political content and distort conclusions about which topics spread and which become viral. We present a large-scale multimodal analysis of WhatsApp data collected via data donation across roughly 100 locations in India during the 2024 Indian General Elections. Using recent vision-language models (VLMs) and large language models (LLMs), we build a multimodal processing toolkit that represents text, images, and videos in a shared framework. We use this toolkit to produce three results. First, we map topic prevalence and diffusion across modalities and show that topic distributions differ sharply by modality: key domains such as politics are disproportionately non-textual, so text-only pipelines systematically undercount them. Second, we evaluate whether LLM/VLM-based classifiers can replace human annotation for socially sensitive labels. Models perform well for broad categories such as political and news content, but they are substantially less reliable for misinformation, hate speech, and caste-coded hostility, with failures driven by class imbalance, implicit meaning, and cultural context. Third, we connect WhatsApp content to mainstream narratives by building a reference set from YouTube uploads of prime-time television news over the same period and measuring multimodal narrative overlap under format transformation (e.g., headlines as screenshots and segments as clips). The narratives that align with mainstream coverage are carried largely by non-text WhatsApp messages and are far more likely to be viral than baseline content. Together, these findings show that multimodal methods are necessary for valid measurement of political communication on encrypted platforms, and they provide an auditable toolkit for studying diffusion, harms, and cross-channel narrative dynamics in low-resource settings.

Downloads

Published

2026-05-25

How to Cite

Nemani, H., & Garimella, K. (2026). Large-Scale Multimodal Content Analysis and Annotation with Vision-Language Models. Proceedings of the International AAAI Conference on Web and Social Media, 20(1), 1676–1699. https://doi.org/10.1609/icwsm.v20i1.42718