Proceedings of the AAAI Conference on Human Computation and Crowdsourcing https://ojs.aaai.org/index.php/HCOMP <p>The Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP) disseminates the latest research findings on human computation and crowdsourcing. While artificial intelligence (AI) and human-computer interaction (HCI) represent traditional mainstays, the papers in the HCOMP proceedings reflect the AAAI conference's broad, interdisciplinary research. The field is particularly unique in the diversity of disciplines it draws upon and contributes to, ranging from human-centered qualitative studies and HCI design, to computer science and artificial intelligence, to economics and the social sciences, all the way to digital humanities, policy, and ethics. The papers in the proceedings represent the exchange of advances in human computation and crowdsourcing not only among researchers, but also engineers and practitioners, thus encouraging dialogue across disciplines and communities of practice.</p> en-US publications@aaai.org (Managine Editor) publications@aaai.org (AAAI Publications) Fri, 03 Nov 2023 13:33:50 -0700 OJS 3.2.1.1 http://blogs.law.harvard.edu/tech/rss 60 Selective Concept Models: Permitting Stakeholder Customisation at Test-Time https://ojs.aaai.org/index.php/HCOMP/article/view/27543 Concept-based models perform prediction using a set of concepts that are interpretable to stakeholders. However, such models often involve a fixed, large number of concepts, which may place a substantial cognitive load on stakeholders. We propose Selective COncept Models (SCOMs) which make predictions using only a subset of concepts and can be customised by stakeholders at test-time according to their preferences. We show that SCOMs only require a fraction of the total concepts to achieve optimal accuracy on multiple real-world datasets. Further, we collect and release a new dataset, CUB-Sel, consisting of human concept set selections for 900 bird images from the popular CUB dataset. Using CUB-Sel, we show that humans have unique individual preferences for the choice of concepts they prefer to reason about, and struggle to identify the most theoretically informative concepts. The customisation and concept selection provided by SCOM improves the efficiency of interpretation and intervention for stakeholders. Matthew Barker, Katherine M. Collins, Krishnamurthy Dvijotham, Adrian Weller, Umang Bhatt Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27543 Fri, 03 Nov 2023 00:00:00 -0700 Informing Users about Data Imputation: Exploring the Design Space for Dealing With Non-Responses https://ojs.aaai.org/index.php/HCOMP/article/view/27544 Machine learning algorithms often require quantitative ratings from users to effectively predict helpful content. When these ratings are unavailable, systems make implicit assumptions or imputations to fill in the missing information; however, users are generally kept unaware of these processes. In our work, we explore ways of informing the users about system imputations, and experiment with imputed ratings and various explanations required by users to correct imputations. We investigate these approaches through the deployment of a text messaging probe to 26 participants to help them manage psychological wellbeing. We provide quantitative results to report users' reactions to correct vs incorrect imputations and potential risks of biasing their ratings. Using semi-structured interviews with participants, we characterize the potential trade-offs regarding user autonomy, and draw insights about alternative ways of involving users in the imputation process. Our findings provide useful directions for future research on communicating system imputation and interpreting user non-responses. Ananya Bhattacharjee, Haochen Song, Xuening Wu, Justice Tomlinson, Mohi Reza, Akmar Ehsan Chowdhury, Nina Deliu, Thomas W. Price, Joseph Jay Williams Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27544 Fri, 03 Nov 2023 00:00:00 -0700 Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees https://ojs.aaai.org/index.php/HCOMP/article/view/27545 We consider the problem of clustering n items into K disjoint clusters using noisy answers from crowdsourced workers to pairwise queries of the type: “Are items i and j from the same cluster?” We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering. Furthermore, our algorithm does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than 1/2 and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantee, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings. Based on both the theoretical and the empirical results, we observe that while the total number of queries made by the active clustering algorithm is order-wise better than random querying, the advantage applies most conspicuously when the datasets have small clusters. For datasets with large enough clusters, passive querying can often be more efficient in practice. Our observations and practically implementable active clustering algorithm can inform and aid the design of real-world crowdsourced clustering systems. We make the dataset collected through this work publicly available (and the code to run such experiments). Yi Chen, Ramya Korlakai Vinayak, Babak Hassibi Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27545 Fri, 03 Nov 2023 00:00:00 -0700 How Crowd Worker Factors Influence Subjective Annotations: A Study of Tagging Misogynistic Hate Speech in Tweets https://ojs.aaai.org/index.php/HCOMP/article/view/27546 Crowdsourced annotation is vital to both collecting labelled data to train and test automated content moderation systems and to support human-in-the-loop review of system decisions. However, annotation tasks such as judging hate speech are subjective and thus highly sensitive to biases stemming from annotator beliefs, characteristics and demographics. We conduct two crowdsourcing studies on Mechanical Turk to examine annotator bias in labelling sexist and misogynistic hate speech. Results from 109 annotators show that annotator political inclination, moral integrity, personality traits, and sexist attitudes significantly impact annotation accuracy and the tendency to tag content as hate speech. In addition, semi-structured interviews with nine crowd workers provide further insights regarding the influence of subjectivity on annotations. In exploring how workers interpret a task — shaped by complex negotiations between platform structures, task instructions, subjective motivations, and external contextual factors — we see annotations not only impacted by worker factors but also simultaneously shaped by the structures under which they labour. Danula Hettiachchi, Indigo Holcombe-James, Stephanie Livingstone, Anjalee de Silva, Matthew Lease, Flora D. Salim, Mark Sanderson Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27546 Fri, 03 Nov 2023 00:00:00 -0700 Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection https://ojs.aaai.org/index.php/HCOMP/article/view/27547 The rapid entry of machine learning approaches in our daily activities and high-stakes domains demands transparency and scrutiny of their fairness and reliability. To help gauge machine learning models' robustness, research typically focuses on the massive datasets used for their deployment, e.g., creating and maintaining documentation for understanding their origin, process of development, and ethical considerations. However, data collection for AI is still typically a one-off practice, and oftentimes datasets collected for a certain purpose or application are reused for a different problem. Additionally, dataset annotations may not be representative over time, contain ambiguous or erroneous annotations, or be unable to generalize across issues or domains. Recent research has shown these practices might lead to unfair, biased, or inaccurate outcomes. We argue that data collection for AI should be performed in a responsible manner where the quality of the data is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In this paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities. This approach impacts the assessment of data robustness used for AI applied in the real world, where diversity of users and content is eminent. Furthermore, it deals with fairness and accountability aspects in data collection by providing systematic and transparent quality analysis for data collections. Oana Inel, Tim Draws, Lora Aroyo Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27547 Fri, 03 Nov 2023 00:00:00 -0700 Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms https://ojs.aaai.org/index.php/HCOMP/article/view/27548 Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets ("slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery. Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27548 Fri, 03 Nov 2023 00:00:00 -0700 A Task-Interdependency Model of Complex Collaboration Towards Human-Centered Crowd Work (Extended Abstract) https://ojs.aaai.org/index.php/HCOMP/article/view/27549 Mathematical models of crowdsourcing and human computation today largely assume small modular tasks, "computational primitives" such as labels, comparisons, or votes requiring little coordination. However, while these models have successfully shown how crowds can accomplish significant objectives, they can inadvertently advance a less than human view of crowd workers where workers are treated as low skilled, replaceable, and untrustworthy, carrying out simple tasks in online labor markets for low pay under algorithmic management. They also fail to capture the unique human capacity for complex collaborative work where the main concerns are how to effectively structure, delegate, and collaborate on work that may be large in scope, underdefined, and highly interdependent. We present a model centered on interdependencies—a phenomenon well understood to be at the core of collaboration—that allows one to formally reason about diverse challenges to complex collaboration. Our model represents tasks as an interdependent collection of subtasks, formalized as a task graph. Each node is a subtask with an arbitrary size parameter. Interdependencies, represented as node and edge weights, impose costs on workers who need to spend time absorbing context of relevant work. Importantly, workers do not have to pay this context cost for work they did themselves. To illustrate how this simple model can be used to reason about diverse aspects of complex collaboration, we apply the model to diverse aspects of complex collaboration. We examine the limits of scaling complex crowd work, showing how high interdependencies and low task granularity bound work capacity to a constant factor of the contributions of top workers, which is in turn limited when workers are short-term novices. We examine recruitment and upskilling, showing the outsized role top workers play in determining work capacity, and surfacing insights on situated learning through a stylized model of legimitate peripheral participation (LPP). Finally, we turn to the economy as a setting where complex collaborative work already exists, using our model to explore the relationship between coordination intensity and occupational wages. Using occupational data from O*NET and the Bureau of Labor Statistics, we introduce a new index of occupational coordination intensity and validate the predicted positive correlation. We find preliminary evidence that higher coordination intensity occupations are more resistant to displacement by AI based on historical growth in automation and OpenAI data on LLM exposure. Our hope is to spur further development of models that emphasize the collaborative capacities of human workers, bridge models of crowd work and traditional work, and promote AI in roles augmenting human collaboration. The full paper can be found at: https://doi.org/10.48550/arXiv.2309.00160. David T. Lee, Christos A. Makridis Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27549 Fri, 03 Nov 2023 00:00:00 -0700 Task as Context: A Sensemaking Perspective on Annotating Inter-Dependent Event Attributes with Non-Experts https://ojs.aaai.org/index.php/HCOMP/article/view/27550 This paper explores the application of sensemaking theory to support non-expert crowds in intricate data annotation tasks. We investigate the influence of procedural context and data context on the annotation quality of novice crowds, defining procedural context as completing multiple related annotation tasks on the same data point, and data context as annotating multiple data points with semantic relevance. We conducted a controlled experiment involving 140 non-expert crowd workers, who generated 1400 event annotations across various procedural and data context levels. Assessments of annotations demonstrate that high procedural context positively impacts annotation quality, although this effect diminishes with lower data context. Notably, assigning multiple related tasks to novice annotators yields comparable quality to expert annotations, without costing additional time or effort. We discuss the trade-offs associated with procedural and data contexts and draw design implications for engaging non-experts in crowdsourcing complex annotation tasks. Tianyi Li, Ping Wang, Tian Shi, Yali Bian, Andy Esakia Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27550 Fri, 03 Nov 2023 00:00:00 -0700 BackTrace: A Human-AI Collaborative Approach to Discovering Studio Backdrops in Historical Photographs https://ojs.aaai.org/index.php/HCOMP/article/view/27551 In historical photo research, the presence of painted backdrops have the potential to help identify subjects, photographers, locations, and events surrounding certain photographs. However, there are few dedicated tools or resources available to aid researchers in this largely manual task. In this paper, we propose BackTrace, a human-AI collaboration system that employs a three-step workflow to retrieve and organize historical photos with similar backdrops. BackTrace is a content-based image retrieval (CBIR) system powered by deep learning that allows for the iterative refinement of search results via user feedback. We evaluated BackTrace with mixed-methods evaluation and found that it successfully aided users in finding photos with similar backdrops and grouping them into collections. Finally, we discuss how our findings can be applied to other domains, as well as implications of deploying BackTrace as a crowdsourcing system. Jude Lim, Vikram Mohanty, Terryl Dodson, Kurt Luther Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27551 Fri, 03 Nov 2023 00:00:00 -0700 Rethinking Quality Assurance for Crowdsourced Multi-ROI Image Segmentation https://ojs.aaai.org/index.php/HCOMP/article/view/27552 Collecting high quality annotations to construct an evaluation dataset is essential for assessing the true performance of machine learning models. One popular way of performing data annotation is via crowdsourcing, where quality can be of concern. Despite much prior work addressing the annotation quality problem in crowdsourcing generally, little has been discussed in detail for image segmentation tasks. These tasks often require pixel-level annotation accuracy, and is relatively complex when compared to image classification or object detection with bounding-boxes. In this paper, we focus on image segmentation annotation via crowdsourcing, where images may not have been collected in a controlled way. In this setting, the task of annotating may be non-trivial, where annotators may experience difficultly in differentiating between regions-of-interest (ROIs) and background pixels. We implement an annotation process and examine the effectiveness of several in-situ and manual quality assurance and quality control mechanisms. We implement an annotation process on a medical image annotation task and examine the effectiveness of several in-situ and manual quality assurance and quality control mechanisms. Our observations on this task are three-fold. Firstly, including an onboarding and a pilot phase improves quality assurance as annotators can familiarize themselves with the task, especially when the definition of ROIs is ambiguous. Secondly, we observe high variability of annotation times, leading us to believe it cannot be relied upon as a source of information for quality control. When performing agreement analysis, we also show that global-level inter-rater agreement is insufficient to provide useful information, especially when annotator skill levels vary. Thirdly, we recognize that reviewing all annotations can be time-consuming and often infeasible, and there currently exist no mechanisms to reduce the workload for reviewers. Therefore, we propose a method to create a priority list of images for review based on inter-rater agreement. Our experiments suggest that this method can be used to improve reviewer efficiency when compared to a baseline approach, especially if a fixed work budget is required. Xiaolu Lu, David Ratcliffe, Tsu-Ting Kao, Aristarkh Tikhonov, Lester Litchfield, Craig Rodger, Kaier Wang Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27552 Fri, 03 Nov 2023 00:00:00 -0700 Accounting for Transfer of Learning Using Human Behavior Models https://ojs.aaai.org/index.php/HCOMP/article/view/27553 An important characteristic of human learning and decision-making is the flexibility with which we rapidly adapt to novel tasks. To this day, models of human behavior have been unable to emulate the ease and success with which humans transfer knowledge in one context to another. Humans rely on a lifetime of experience and a variety of cognitive mechanisms that are difficult to represent computationally. To address this problem, we propose a novel human behavior model that accounts for human transfer of learning using three mechanisms: compositional reasoning, causal inference, and optimal forgetting. To evaluate this proposed model, we introduce an experiment task designed to elicit human transfer of learning under different conditions. Our proposed model demonstrates a more human-like transfer of learning compared to models that optimize transfer or human behavior models that do not directly account for transfer of learning. The results of the ablation testing of the proposed model and a systematic comparison to human data demonstrate the importance of each component of the cognitive model underlying the transfer of learning. Tyler Malloy, Yinuo Du, Fei Fang, Cleotilde Gonzalez Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27553 Fri, 03 Nov 2023 00:00:00 -0700 A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity https://ojs.aaai.org/index.php/HCOMP/article/view/27554 Hybrid human-ML systems increasingly make consequential decisions in a wide range of domains. These systems are often introduced with the expectation that the combined human-ML system will achieve complementary performance, that is, the combined decision-making system will be an improvement compared with either decision-making agent in isolation. However, empirical results have been mixed, and existing research rarely articulates the sources and mechanisms by which complementary performance is expected to arise. Our goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, we propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ. In doing so, we conceptually map potential mechanisms by which combining human and ML decision-making may yield complementary performance, developing a language for the research community to reason about design of hybrid systems in any decision-making domain. To illustrate how our taxonomy can be used to investigate complementarity, we provide a mathematical aggregation framework to examine enabling conditions for complementarity. Through synthetic simulations, we demonstrate how this framework can be used to explore specific aspects of our taxonomy and shed light on the optimal mechanisms for combining human-ML judgments. Charvi Rastogi, Liu Leqi, Kenneth Holstein, Hoda Heidari Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27554 Fri, 03 Nov 2023 00:00:00 -0700 Characterizing Time Spent in Video Object Tracking Annotation Tasks: A Study of Task Complexity in Vehicle Tracking https://ojs.aaai.org/index.php/HCOMP/article/view/27555 Video object tracking annotation tasks are a form of complex data labeling that is inherently tedious and time-consuming. Prior studies of these tasks focus primarily on quality of the provided data, leaving much to be learned about how the data was generated and the factors that influenced how it was generated. In this paper, we take steps toward this goal by examining how human annotators spend their time in the context of a video object tracking annotation task. We situate our study in the context of a standard vehicle tracking task with bounding box annotation. Within this setting, we study the role of task complexity by controlling two dimensions of task design -- label constraint and label granularity -- in conjunction with worker experience. Using telemetry and survey data collected from 40 full-time data annotators at a large technology corporation, we find that each dimension of task complexity uniquely affects how annotators spend their time not only during the task, but also before it begins. Furthermore, we find significant misalignment in how time-use was observed and how time-use was self-reported. We conclude by discussing the implications of our findings in the context of video object tracking and the need to better understand how productivity can be defined in data annotation. Amy Rechkemmer, Alex C. Williams, Matthew Lease, Li Erran Li Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27555 Fri, 03 Nov 2023 00:00:00 -0700 Humans Forgo Reward to Instill Fairness into AI https://ojs.aaai.org/index.php/HCOMP/article/view/27556 In recent years, artificial intelligence (AI) has become an integral part of our daily lives, assisting us with decision making. During such interactions, AI algorithms often use human behavior as training input. Therefore, it is important to understand whether people change their behavior when they train AI and if they continue to do so when training does not benefit them. In this work, we conduct behavioral experiments in the context of the ultimatum game to answer these questions. In our version of this game, participants were asked to decide whether to accept or reject proposals of monetary splits made by either other human participants or AI. Some participants were informed that their choices would be used to train AI, while others did not receive this information. In the first experiment, we found that participants were willing to sacrifice personal earnings to train AI to be fair as they became less inclined to accept unfair offers. The second experiment replicated and expanded upon this finding, revealing that participants were motivated to train AI even if they would never encounter it in the future. These findings demonstrate that humans are willing to incur costs to change AI algorithms. Moreover, they suggest that human behavior during AI training does not necessarily align with baseline preferences. This observation poses a challenge for AI development, revealing that it is important for AI algorithms to account for their influence on behavior when recommending choices. Lauren S. Treiman, Chien-Ju Ho, Wouter Kool Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27556 Fri, 03 Nov 2023 00:00:00 -0700 Does Human Collaboration Enhance the Accuracy of Identifying LLM-Generated Deepfake Texts? https://ojs.aaai.org/index.php/HCOMP/article/view/27557 Advances in Large Language Models (e.g., GPT-4, LLaMA) have improved the generation of coherent sentences resembling human writing on a large scale, resulting in the creation of so-called deepfake texts. However, this progress poses security and privacy concerns, necessitating effective solutions for distinguishing deepfake texts from human-written ones. Although prior works studied humans’ ability to detect deepfake texts, none has examined whether “collaboration” among humans improves the detection of deepfake texts. In this study, to address this gap of understanding on deepfake texts, we conducted experiments with two groups: (1) nonexpert individuals from the AMT platform and (2) writing experts from the Upwork platform. The results demonstrate that collaboration among humans can potentially improve the detection of deepfake texts for both groups, increasing detection accuracies by 6.36% for non-experts and 12.76% for experts, respectively, compared to individuals’ detection accuracies. We further analyze the explanations that humans used for detecting a piece of text as deepfake text, and find that the strongest indicator of deepfake texts is their lack of coherence and consistency. Our study provides useful insights for future tools and framework designs to facilitate the collaborative human detection of deepfake texts. The experiment datasets and AMT implementations are available at: https://github.com/huashen218/llm-deepfake-human-study.git Adaku Uchendu, Jooyoung Lee, Hua Shen, Thai Le, Ting-Hao 'Kenneth' Huang, Dongwon Lee Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27557 Fri, 03 Nov 2023 00:00:00 -0700 A Cluster-Aware Transfer Learning for Bayesian Optimization of Personalized Preference Models https://ojs.aaai.org/index.php/HCOMP/article/view/27558 Obtaining personalized models of the crowd is an important issue in various applications, such as preference acquisition and user interaction customization. However, the crowd setting, in which we assume we have little knowledge about the person, brings the cold start problem, which may cause avoidable unpreferable interactions with the people. This paper proposes a cluster-aware transfer learning method for the Bayesian optimization of personalized models. The proposed method, called Cluster-aware Bayesian Optimization, is designed based on a known feature: user preferences are not completely independent but can be divided into clusters. It exploits the clustering information to efficiently find the preference of the crowds while avoiding unpreferable interactions. The results of our extensive experiments with different data sets show that the method is efficient for finding the most preferable items and effective in reducing the number of unpreferable interactions. Haruto Yamasaki, Masaki Matsubara, Hiroyoshi Ito, Yuta Nambu, Masahiro Kohjima, Yuki Kurauchi, Ryuji Yamamoto, Atsuyuki Morishima Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27558 Fri, 03 Nov 2023 00:00:00 -0700 Confidence Contours: Uncertainty-Aware Annotation for Medical Semantic Segmentation https://ojs.aaai.org/index.php/HCOMP/article/view/27559 Medical image segmentation modeling is a high-stakes task where understanding of uncertainty is crucial for addressing visual ambiguity. Prior work has developed segmentation models utilizing probabilistic or generative mechanisms to infer uncertainty from labels where annotators draw a singular boundary. However, as these annotations cannot represent an individual annotator's uncertainty, models trained on them produce uncertainty maps that are difficult to interpret. We propose a novel segmentation representation, Confidence Contours, which uses high- and low-confidence ``contours’’ to capture uncertainty directly, and develop a novel annotation system for collecting contours. We conduct an evaluation on the Lung Image Dataset Consortium (LIDC) and a synthetic dataset. From an annotation study with 30 participants, results show that Confidence Contours provide high representative capacity without considerably higher annotator effort. We also find that general-purpose segmentation models can learn Confidence Contours at the same performance level as standard singular annotations. Finally, from interviews with 5 medical experts, we find that Confidence Contour maps are more interpretable than Bayesian maps due to representation of structural uncertainty. Andre Ye, Quan Ze Chen, Amy Zhang Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27559 Fri, 03 Nov 2023 00:00:00 -0700 A Crowd–AI Collaborative Approach to Address Demographic Bias for Student Performance Prediction in Online Education https://ojs.aaai.org/index.php/HCOMP/article/view/27560 Recent advances in artificial intelligence (AI) and crowdsourcing have shown success in enhancing learning experiences and outcomes in online education. This paper studies a student performance prediction problem where the objective is to predict students' outcomes in online courses based on their behavioral data. In particular, we focus on addressing the limitation of current student performance prediction solutions that often make inaccurate predictions for students from underrepresented demographic groups due to the lack of training data and differences in behavioral patterns across groups. We develop DebiasEdu, a crowd–AI collaborative debias framework that melds the AI and crowd intelligence through 1) a novel gradient-based bias identification mechanism and 2) a bias-aware crowdsourcing interface and bias calibration design to achieve an accurate and fair student performance prediction. Evaluation results on two online courses demonstrate that DebiasEdu consistently outperforms state-of-the-art AI, fair AI, and crowd–AI baselines by achieving an optimized student performance prediction in terms of both accuracy and fairness. Ruohan Zong, Yang Zhang, Frank Stinar, Lanyu Shang, Huimin Zeng, Nigel Bosch, Dong Wang Copyright (c) 2023 Association for the Advancement of Artificial Intelligence https://ojs.aaai.org/index.php/HCOMP/article/view/27560 Fri, 03 Nov 2023 00:00:00 -0700