Perception of Visual Content: Differences Between Humans and Foundation Models

Nardiena A. Pratama; Shaoyang Fan; Gianluca Demartini

doi:10.1609/icwsm.v19i1.35891

Authors

Nardiena A. Pratama The University of Queensland, Brisbane, Australia
Shaoyang Fan The University of Queensland, Brisbane, Australia
Gianluca Demartini The University of Queensland, Brisbane, Australia

DOI:

https://doi.org/10.1609/icwsm.v19i1.35891

Abstract

Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study explores the similarity between human-generated and ML-generated annotations of images across diverse socio-economic contexts (RQ1) and their impact on ML model performance and bias (RQ2). We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. ML captions and human labels show highest similarity at a low-level, i.e., types of words that appear and sentence structures, but all annotations are consistent in how they perceive images across regions. ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. ML annotations worked best for action categories, while human input was more effective for non-action categories. These findings highlight the notion that both human and machine annotations are important, and that human-generated annotations are yet to be replaceable.

Perception of Visual Content: Differences Between Humans and Foundation Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information