Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

Frontmatter

Gianluca Demartini — 2024-10-16

The Twelfth AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2024) was held between October 16th and 19th in Pittsburgh, Pennsylvania and focused on the theme of Responsible Crowd Work for Better AI.

The frontmatter includes:

A Preface by the organizers
A list of the Best Papers
The Conference Committee
The list of Sponsors

An Exploratory Study of the Impact of Task Selection Strategies on Worker Performance in Crowdsourcing Microtasks

Huda Banuqitah — 2024-10-14

In microtask crowdsourcing systems like Amazon Mechanical Turk (AMT) and Appen Figure-Eight, workers often employ task selection strategies, completing sequences of tasks to maximize earnings. While previous literature has explored the effects of sequential tasks with varying complexities of the same type, there is a lack of knowledge on the consequences when multiple types of tasks with similar levels of difficulty are performed. This study examines the impact of sequences of three frequently employed task types, namely image classification, text classification, and surveys, on workers' engagement, accuracy, and perceived workloads. In addition, we analyze the influence of workers' personality traits on their strategies for selecting tasks. Our study, which involved 558 participants using AMT, found that engaging in sequences of distinct task types had a detrimental effect on classification task engagement and accuracy. It also increases the perceived task load and the worker's frustration. Nevertheless, the precise order of tasks does not significantly impact these results. Moreover, we showed a slight association between personality traits and the workers' selection strategy for the tasks. The results offered valuable knowledge for designing an efficient and inclusive crowdsourcing platform.

“Hi. I’m Molly, Your Virtual Interviewer!” Exploring the Impact of Race and Gender in AI-Powered Virtual Interview Experiences

Shreyan Biswas — 2024-10-14

The persistent issue of human bias in recruitment processes poses a formidable challenge to achieving equitable hiring practices, particularly when influenced by demographic characteristics such as gender and race of both interviewers and candidates. Asynchronous Video Interviews (AVIs), powered by Artificial Intelligence (AI), have emerged as innovative tools aimed at streamlining the application screening process while potentially mitigating the impact of such biases. These AI-driven platforms present an opportunity to customize the demographic features of virtual interviewers to align with diverse applicant preferences, promising a more objective and fair evaluation. Despite their growing adoption, the implications of virtual interviewer identities on candidate experiences within AVIs remain underexplored. We aim to address this research and empirical gap in this paper. To this end, we carried out a comprehensive between-subjects study involving 218 participants across six distinct experimental conditions, manipulating the gender and skin color of an AI virtual interviewer agent. Our empirical analysis revealed that while the demographic attributes of the agents did not significantly influence the overall experience of interviewees, variations in the interviewees' demographics, significantly altered their perception of the AVI process. Further, we uncovered that the mediating roles of Social Presence and Perception of the virtual interviewer critically affect interviewees' Perceptions of Fairness (+), Privacy (-), and Impression management (+).

Disclosures & Disclaimers: Investigating the Impact of Transparency Disclosures and Reliability Disclaimers on Learner-LLM Interactions

Jessica Y. Bo — 2024-10-14

Large Language Models (LLMs) are increasingly being used in educational settings to assist students with assignments and learning new concepts. For LLMs to be effective learning aids, students must develop an appropriate level of trust and reliance on these tools. Misaligned trust and reliance can lead to suboptimal learning outcomes and reduced LLM engagement. Despite their growing presence, there is a limited understanding of achieving optimal transparency and reliance calibration in the educational use of LLMs. In a 3x2 between-subjects experiment conducted in a university classroom setting, we tested the effect of two transparency disclosures (System Prompt and Goal Summary) and an in-conversation Reliability Disclaimer on a GPT-4-based chatbot tutor provided to students for an assignment. Our findings suggest that disclaimer messages included in the responses may effectively mitigate learners' overreliance on the LLM Tutor in the presence of incorrect advice. Disclosing System Prompt seemed to calibrate students’ confidence in their answers and reduce the occurrence of copy-pasting the exact assignment question to the LLM tutor. Student feedback indicated that they would like transparency framed in terms of performance-based metrics. Our work provides empirical insights on the design of transparency and reliability mechanisms for using LLMs in classrooms.

Atlas of AI Risks: Enhancing Public Understanding of AI Risks

Edyta Bogucka — 2024-10-14

The prevailing methodologies for visualizing AI risks have focused on technical issues such as data biases and model inaccuracies, often overlooking broader societal risks like job loss and surveillance. Moreover, these visualizations are typically designed for tech-savvy individuals, neglecting those with limited technical skills. To address these challenges, we propose the Atlas of AI Risks—a narrative-style tool designed to map the broad risks associated with various AI technologies in a way that is understandable to non-technical individuals as well. To both develop and evaluate this tool, we conducted two crowdsourcing studies. The first, involving 40 participants, identified the design requirements for visualizing AI risks for decision-making and guided the development of the Atlas. The second study, with 140 participants reflecting the US population in terms of age, sex, and ethnicity, assessed the usability and aesthetics of the Atlas to ensure it met those requirements. Using facial recognition technology as a case study, we found that the Atlas is more user-friendly than a baseline visualization, with a more classic and expressive aesthetic, and is more effective in presenting a balanced assessment of the risks and benefits of facial recognition. Finally, we discuss how our design choices make the Atlas adaptable for broader use, allowing it to generalize across the diverse range of technology applications represented in a database that reports various AI incidents.

Toward Context-Aware Privacy Enhancing Technologies for Online Self-Disclosure

Tingting Du — 2024-10-14

Voluntary sharing of personal information is at the heart of user engagement on social media and central to platforms' business models. From the users' perspective, so-called self-disclosure is closely connected with both privacy risks and social rewards. Prior work has studied contextual influences on self-disclosure, from platform affordances and interface design to user demographics and perceived social capital. Our work takes a mixed-methods approach to understand the contextual information which might be integrated in the development of privacy-enhancing technologies. Through observational study of several Reddit communities, we explore the ways in which topic of discussion, group norms, peer effects, and audience size are correlated with personal information sharing. We then build and test a prototype privacy-enhancing tool that exposes these contextual factors. Our work culminates in a browser extension that automatically detects instances of self-disclosure in Reddit posts at the time of posting and provides additional context to users before they post to support enhanced privacy decision-making. We share this prototype with social media users, solicit their feedback, and outline a path forward for privacy-enhancing technologies in this space.

Unveiling the Inter-Related Preferences of Crowdworkers: Implications for Personalized and Flexible Platform Design

Senjuti Dutta — 2024-10-14

Crowdsourcing platforms have traditionally been designed with a focus on workstation interfaces, restricting the flexibility that crowdworkers need. Recognizing this limitation and the need for more adaptable platforms, prior research has highlighted the diverse work processes of crowdworkers, influenced by factors such as device type and work stage. However, these variables have largely been studied in isolation. Our study is the first to explore the interconnected variabilities among these factors within the crowdwork community. Through a survey involving 150 Amazon Mechanical Turk crowdworkers, we uncovered three distinct groups characterized by their interrelated variabilities in key work aspects. The largest group exhibits a reliance on traditional devices, showing limited interest in integrating smartphones and tablets into their work routines. The second-largest group also primarily uses traditional devices but expresses a desire for supportive tools and scripts that enhance productivity across all devices, particularly smartphones and tablets. The smallest group actively uses and strongly prefers non-workstation devices, especially smartphones and tablets, for their crowdworking activities. We translate our findings into design insights for platform developers, discussing the implications for creating more personalized, flexible, and efficient crowdsourcing environments. Additionally, we highlight the unique work practices of these crowdworker clusters, offering a contrast to those of more traditional and established worker groups.

Estimating Contribution Quality in Online Deliberations Using a Large Language Model

Lodewijk Gelauff — 2024-10-14

Deliberation involves participants exchanging knowledge, arguments, and perspectives and has been shown to be effective at addressing polarization. The Stanford Online Deliberation Platform facilitates large-scale deliberations. It enables video-based online discussions on a structured agenda for small groups without requiring human moderators. This paper's data comes from various deliberation events, including one conducted in collaboration with Meta in 32 countries, and another with 38 post-secondary institutions in the US. Estimating the quality of contributions in a conversation is crucial for assessing feature and intervention impacts. Traditionally, this is done by human annotators, which is time-consuming and costly. We use a large language model (LLM) alongside eight human annotators to rate contributions based on justification, novelty, expansion of the conversation, and potential for further expansion, with scores ranging from 1 to 5. Annotators also provide brief justifications for their ratings. Using the average rating from other human annotators as the ground truth, we find the model outperforms individual human annotators. While pairs of human annotators outperform the model in rating justification and groups of three outperform it on all four metrics, the model remains competitive. We illustrate the usefulness of the automated quality rating by assessing the effect of nudges on the quality of deliberation. We first observe that individual nudges after prolonged inactivity are highly effective, increasing the likelihood of the individual requesting to speak in the next 30 seconds by 65%. Using our automated quality estimation, we show that the quality ratings for statements prompted by nudging are similar to those made without nudging, signifying that nudging leads to more ideas being generated in the conversation without losing overall quality.

Investigating What Factors Influence Users’ Rating of Harmful Algorithmic Bias and Discrimination

Sara Kingsley — 2024-10-14

There has been growing recognition of the crucial role users, especially those from marginalized groups, play in uncovering harmful algorithmic biases. However, it remains unclear how users’ identities and experiences might impact their rating of harmful biases. We present an online experiment (N=2,197) examining these factors: demographics, discrimination experiences, and social and technical knowledge. Participants were shown examples of image search results, including ones that previous literature has identified as biased against marginalized racial, gender, or sexual orientation groups. We found participants from marginalized gender or sexual orientation groups were more likely to rate the examples as more severely harmful. Belonging to marginalized races did not have a similar pattern. Additional factors affecting users’ ratings included discrimination experiences, and having friends or family belonging to marginalized demographics. A qualitative analysis offers insights into users' bias recognition, and why they see biases the way they do. We provide guidance for designing future methods to support effective user-driven auditing.

Combining Human and AI Strengths in Object Counting under Information Asymmetry

Songyu Liu — 2024-10-14

With the recent development of artificial intelligence (AI), hybrid human-AI teams have gained more attention and have been employed to solve all kinds of problems. However, existing research tends to focus on the setting where the same task is given to humans and AI. This work investigates a scenario where different agents have access to different types of information regarding the same underlying problem. We propose a probabilistic framework that combines the predictions of humans and AI based on the quality of the information given by both agents. We apply this framework to a regression task in which humans and AI are given different views of a jar and aim to estimate the number of objects in it. We demonstrate that our model can outperform methods that ignore information asymmetry. Furthermore, we show that complementarity can be achieved, i.e., combining human and AI predictions leads to better performance than relying on humans or AI alone. This framework can be adapted to solve other problems in which different sources of information from multiple agents are present.

Mix and Match: Characterizing Heterogeneous Human Behavior in AI-assisted Decision Making

Zhuoran Lu — 2024-10-14

AI-assisted decision-making systems hold immense potential to enhance human judgment, but their effectiveness is often hindered by a lack of understanding of the diverse ways in which humans take AI recommendations. Current research frequently relies on simplified, ``one-size-fits-all'' models to characterize an average human decision-maker, thus failing to capture the heterogeneity of people's decision-making behavior when incorporating AI assistance. To address this, we propose Mix and Match (M&M), a novel computational framework that explicitly models the diversity of human decision-makers and their unique patterns of relying on AI assistance. M&M represents the population of decision-makers as a mixture of distinct decision-making processes, with each process corresponding to a specific type of decision-maker. This approach enables us to infer latent behavioral patterns from limited data of human decisions under AI assistance, offering valuable insights into the cognitive processes underlying human-AI collaboration. Using real-world behavioral data, our empirical evaluation demonstrates that M&M consistently outperforms baseline methods in predicting human decision behavior. Furthermore, through a detailed analysis of the decision-maker types identified in our framework, we provide quantitative insights into nuanced patterns of how different individuals adopt AI recommendations. These findings offer implications for designing personalized and effective AI systems based on the diverse landscape of human behavior patterns in AI-assisted decision-making across various domains.

Utility-Oriented Knowledge Graph Accuracy Estimation with Limited Annotations: A Case Study on DBpedia

Stefano Marchesin — 2024-10-14

Knowledge Graphs (KGs) are essential for applications like search, recommendation, and virtual assistants, where their accuracy directly impacts effectiveness. However, due to their large-scale and ever-evolving nature, it is impractical to manually evaluate all KG contents. We propose a framework that employs sampling, estimation, and active learning to audit KG accuracy in a cost-effective manner. The framework prioritizes KG facts based on their utility to downstream tasks. We applied the framework to DBpedia and gathered annotations from both expert and layman annotators. We also explored the potential of Large Language Models (LLMs) as KG evaluators, showing that while they can perform comparably to low-quality human annotators, they tend to overestimate KG accuracy. As such, LLMs are currently insufficient to replace human crowdworkers in the evaluation process. The results also provide insights into the scalability of methods for auditing KGs.

Assessing Educational Quality: Comparative Analysis of Crowdsourced, Expert, and AI-Driven Rubric Applications

Steven Moore — 2024-10-14

Exposing students to low-quality assessments such as multiple-choice questions (MCQs) and short answer questions (SAQs) is detrimental to their learning, making it essential to accurately evaluate these assessments. Existing evaluation methods are often challenging to scale and fail to consider their pedagogical value within course materials. Online crowds offer a scalable and cost-effective source of intelligence, but often lack necessary domain expertise. Advancements in Large Language Models (LLMs) offer automation and scalability, but may also lack precise domain knowledge. To explore these trade-offs, we compare the effectiveness and reliability of crowdsourced and LLM-based methods for assessing the quality of 30 MCQs and SAQs across six educational domains using two standardized evaluation rubrics. We analyzed the performance of 84 crowdworkers from Amazon's Mechanical Turk and Prolific, comparing their quality evaluations to those made by the three LLMs: GPT-4, Gemini 1.5 Pro, and Claude 3 Opus. We found that crowdworkers on Prolific consistently delivered the highest-quality assessments, and GPT-4 emerged as the most effective LLM for this task. Our study reveals that while traditional crowdsourced methods often yield more accurate assessments, LLMs can match this accuracy in specific evaluative criteria. These results provide evidence for a hybrid approach to educational content evaluation, integrating the scalability of AI with the nuanced judgment of humans. We offer feasibility considerations in using AI to supplement human judgment in educational assessment.

Predicting and Understanding Human Action Decisions: Insights from Large Language Models and Cognitive Instance-Based Learning

Thuy Ngoc Nguyen — 2024-10-14

Large Language Models (LLMs) excel in tasks from translation to complex reasoning. For AI systems to help effectively, understanding and predicting human behavior and biases is essential. However, it remains an open question whether LLMs can achieve this goal. This paper addresses this gap by leveraging the reasoning and generative capabilities of LLMs to predict human behavior in two sequential decision-making tasks. These tasks involve balancing between exploratory and exploitative actions and handling delayed feedback, which is essential for simulating real-life decision processes. We compare the performance of LLMs with a cognitive instance-based learning (IBL) model, which imitates human experiential decision-making. Our findings indicate that LLMs excel at rapidly incorporating feedback to enhance prediction accuracy. In contrast, the IBL model better accounts for human exploratory behaviors and effectively captures loss aversion bias — the tendency to choose a sub-optimal goal with fewer step-cost penalties rather than exploring to find the optimal choice, even with limited experience. The results highlight the benefits of integrating LLMs with cognitive architectures, suggesting that this synergy could enhance the modeling and understanding of complex human decision-making patterns.

User Profiling in Human-AI Design: An Empirical Case Study of Anchoring Bias, Individual Differences, and AI Attitudes

Mahsan Nourani — 2024-10-14

People form perceptions and interpretations of AI through external sources prior to their interaction with new technology. For example, shared anecdotes and media stories influence prior beliefs that may or may not accurately represent the true nature of AI systems. We hypothesize people's prior perceptions and beliefs will affect human-AI interactions and usage behaviors when using new applications. This paper presents a user experiment to explore the interplay between user's pre-existing beliefs about AI technology, individual differences, and previously established sources of cognitive bias from first impressions with an interactive AI application. We employed questionnaire measures as features to categorize users into profiles based on their prior beliefs and attitudes about technology. In addition, participants were assigned to one of two controlled conditions designed to evoke either positive or negative first impressions during an AI-assisted judgment task using an interactive application. The experiment and results provide empirical evidence that profiling users by surveying them on their prior beliefs and differences can be a beneficial approach for bias (and/or unanticipated usage) mitigation instead of seeking one-size-fits-all solutions.

Responsible Crowdsourcing for Responsible Generative AI: Engaging Crowds in AI Auditing and Evaluation

Wesley Hanwen Deng — 2024-10-14

With the rise of generative AI (GenAI), there has been an increased need for participation by large and diverse user bases in AI evaluation and auditing. GenAI developers are increasingly adopting crowdsourcing approaches to test and audit their AI products and services. However, it remains an open question how to design and deploy responsible and effective crowdsourcing pipelines for AI auditing and evaluation. This workshop aims to take a step towards bridging this gap. Our interdisciplinary team of organizers will work with workshop participants to explore several key questions, such as how to improve the output quality and workers' productivity for GenAI evaluation crowdsourcing tasks compared to discriminative AI systems, how to guide crowds in auditing problematic AI-generated content while managing their psychological impact, ensuring marginalized voices are heard, and setting up responsible and effective crowdsourcing pipelines for real-world GenAI evaluation. We hope this workshop will produce a research agenda and best practices for designing responsible crowd-based approaches to AI auditing and evaluation.

PACE: Participatory AI for Community Engagement

Saad Hassan — 2024-10-14

Public sector leverages artificial intelligence (AI) to enhance the efficiency, transparency, and accountability of civic operations and public services. This includes initiatives such as predictive waste management, facial recognition for identification, and advanced tools in the criminal justice system. While public-sector AI can improve efficiency and accountability, it also has the potential to perpetuate biases, infringe on privacy, and marginalize vulnerable groups. Responsible AI (RAI) research aims to address these concerns by focusing on fairness and equity through participatory AI. We invite researchers, community members, and public sector workers to collaborate on designing, developing, and deploying RAI systems that enhance public sector accountability and transparency. Key topics include raising awareness of AI's impact on the public sector, improving access to AI auditing tools, building public engagement capacity, fostering early community involvement to align AI innovations with public needs, and promoting accessible and inclusive participation in AI development. The workshop will feature two keynotes, two short paper sessions, and three discussion-oriented activities. Our goal is to create a platform for exchanging ideas and developing strategies to design community-engaged RAI systems while mitigating the potential harms of AI and maximizing its benefits in the public sector.

Human Computation, Equitable, and Innovative Future of Work AI Tools

Kashif Imteyaz — 2024-10-14

As we enter an era where the synergy between AI technologies and human effort is paramount, the Future of Work is undergoing a radical transformation. Emerging AI tools will profoundly influence how we work, the tools we use, and the very nature of work itself. The ’Human Computation, Equitable, and Innovative Future of Work AI Tools’ workshop at HCOMP’24 aims to explore groundbreaking solutions for developing fair and inclusive AI tools that shape how we will work. This workshop will delve into the collaborative potential of human computation and artificial intelligence in crafting equitable Future of Work AI tools. Participants will critically examine the current challenges in designing fair and innovative AI systems for the evolving workplace, as well as strategies for effectively integrating human insights into these tools. The primary objective is to foster a rich discourse on scalable, sustainable solutions that promote equitable Future of Work tools for all, with a particular focus on empowering marginalized communities. By bringing together experts from diverse fields, we aim to catalyze ideas that bridge the gap between technological advancement and social equity.