Perception of Visual Content: Differences Between Humans and Foundation Models

Authors

  • Nardiena A. Pratama The University of Queensland, Brisbane, Australia
  • Shaoyang Fan The University of Queensland, Brisbane, Australia
  • Gianluca Demartini The University of Queensland, Brisbane, Australia

DOI:

https://doi.org/10.1609/icwsm.v19i1.35891

Abstract

Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study explores the similarity between human-generated and ML-generated annotations of images across diverse socio-economic contexts (RQ1) and their impact on ML model performance and bias (RQ2). We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. ML captions and human labels show highest similarity at a low-level, i.e., types of words that appear and sentence structures, but all annotations are consistent in how they perceive images across regions. ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. ML annotations worked best for action categories, while human input was more effective for non-action categories. These findings highlight the notion that both human and machine annotations are important, and that human-generated annotations are yet to be replaceable.

Downloads

Published

2025-06-07

How to Cite

Pratama, N. A., Fan, S., & Demartini, G. (2025). Perception of Visual Content: Differences Between Humans and Foundation Models. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 1616–1629. https://doi.org/10.1609/icwsm.v19i1.35891