Proceedings of the International AAAI Conference on Web and Social Media

Erratum to: Rules and Rule-Making in the Five Largest Wikipedias

Sohyeon Hwang, Aaron Shaw — Mon, 10 Jul 2023 00:00:00 -0700

The Original Article was published on 31 May 2023.

How Do US Congress Members Advertise Climate Change: An Analysis of Ads Run on Meta’s Platforms

Laurenz Aisenpreis, Gustav Gyrst, Vedran Sekara — Fri, 02 Jun 2023 00:00:00 -0700

Ensuring transparency and integrity in political communication on climate change has arguably never been more important than today. Yet we know little about how politicians focus on, talk about, and portray climate change on social media. Here we study it from the perspective of political advertisement. We use Meta’s Ad Library to collect 602,546 ads that have been issued by US Congress members since mid-2018. Out of those only 19,176 (3.2%) are climate-related. Analyzing this data, we find that Democrats focus substantially more on climate change than Republicans, with 99.7% of all climate-related ads stemming from Democratic politicians. In particular, we find this is driven by a small core of Democratic politicians, where 72% of all impressions can be attributed to 10 politicians. Interestingly, we find a significant difference in the average amount of impressions generated per dollar spent between the two parties. Republicans generate on average 188% more impressions with their climate ads for the same money spent as Democrats. We build models to explain the differences and find that demographic factors only partially explain the variance. Our results demonstrate differences of climate-related advertisements of US congress members and reveal differences in advertising characteristics between the two political parties. We anticipate our work to be a starting point for further studies about climate-related ads on Meta’s platforms.

The Pursuit of Peer Support for Opioid Use Recovery on Reddit

Fri, 02 Jun 2023 00:00:00 -0700

Individuals suffering from Opioid Use Disorder and other socially stigmatized conditions often rely on peer support groups to find comfort and motivation while treating their condition. Many may face barriers in accessing peer support treatment, such as shame and social stigma, seclusion, or mobility restrictions. In this study, we quantitatively characterize the potential of the Reddit community in offering these individuals an online alternative to receiving peer support. By analyzing the social interactions of thousands of users during the start of opioid use recovery, we uncover that a particular Reddit community exhibits many characteristics similar to in-person peer support groups, featuring the exchange of support, trust, status, and similar experiences. We find that the supportive behavior of this community nudges users to change their personal behavior, and promotes abandoning opioid-related communities in favor of recovery-oriented relationships. Finally, we find that recognition, acknowledgment, and knowledge exchange are the most relevant factors in sustained engagement with the recovery community. Given this evidence, we suggest that this online community may constitute a complement or a surrogate to peer support groups when in-person meetings are not desirable or possible. Our work might inspire harm reduction policies and interventions to favor successful rehabilitation and is fundamental for future research about the use of digital media for recovery support.

Exposure to Marginally Abusive Content on Twitter

Jack Bandy, Tomo Lazovich — Fri, 02 Jun 2023 00:00:00 -0700

Social media platforms can help people find connection and entertainment, but they can also show potentially abusive content such as insults and targeted cursing. While platforms do remove some abusive content for rule violation, some is considered "margin content" that does not violate any rules and thus stays on the platform. This paper presents a focused analysis of exposure to such content on Twitter, asking (RQ1) how exposure to marginally abusive content varies across Twitter users, and (RQ2) how algorithmically-ranked timelines impact exposure to marginally abusive content. Based on one month of impression data from November 2021, descriptive analyses (RQ1) show significant variation in exposure, with more active users experiencing higher rates and higher volumes of marginal impressions. Experimental analyses (RQ2) show that users with algorithmically-ranked timelines experience slightly lower rates of marginal impressions. However, they tend to register more total impression activity and thus experience a higher cumulative volume of marginal impressions. The paper concludes by discussing implications of the observed concentration, the multifaceted impact of algorithmically-ranked timelines, and potential directions for future work.

Finding Qs: Profiling QAnon Supporters on Parler

Dominik Bär, Nicolas Pröllochs, Stefan Feuerriegel — Fri, 02 Jun 2023 00:00:00 -0700

The social media platform "Parler'' has emerged into a prominent fringe community where a significant part of the user base are self-reported supporters of QAnon, a far-right conspiracy theory alleging that a cabal of elites controls global politics. QAnon is considered to have had an influential role in the public discourse during the 2020 U.S. presidential election. However, little is known about QAnon supporters on Parler and what sets them aside from other users. Building up on social identity theory, we aim to profile the characteristics of QAnon supporters on Parler. We analyze a large-scale dataset with more than 600,000 profiles of English-speaking users on Parler. Based on users' profiles, posts, and comments, we then extract a comprehensive set of user features, linguistic features, network features, and content features. This allows us to perform user profiling and understand to what extent these features discriminate between QAnon and non-QAnon supporters on Parler. Our analysis is three-fold: (1) We quantify the number of QAnon supporters on Parler, finding that 34,913 users (5.5% of all users) openly report supporting the conspiracy. (2) We examine differences between QAnon vs. non-QAnon supporters. We find that QAnon supporters differ statistically significantly from non-QAnon supporters across multiple dimensions. For example, they have, on average, a larger number of followers, followees, and posts, and thus have a large impact on the Parler network. (3) We use machine learning to identify which user characteristics discriminate QAnon from non-QAnon supporters. We find that user features, linguistic features, network features, and content features, can - to a large extent - discriminate QAnon vs. non-QAnon supporters on Parler. In particular, we find that user features are highly discriminatory, followed by content features and linguistic features.

Predicting Future Location Categories of Users in a Large Social Platform

Fri, 02 Jun 2023 00:00:00 -0700

Understanding the users' patterns of visiting various location categories can help online platforms improve content personalization and user experiences. Current literature on predicting future location categories of a user typically employs features that can be traced back to the user, such as spatial geo-coordinates and demographic identities. Moreover, existing approaches commonly suffer from cold-start and generalization problems, and often cannot specify when the user will visit the predicted location category. In a large social platform, it is desirable for prediction models to avoid using user-identifiable data, generalize to unseen and new users, and be able to make predictions for specific times in the future. In this work, we construct a neural model, LocHabits, using data from Snapchat. The model omits user-identifiable inputs, leverages temporal and sequential regularities in the location category histories of Snapchat users and their friends, and predicts the users' next-hour location categories. We evaluate our model on several real-life, large-scale datasets from Snapchat and FourSquare, and find that the model can outperform baselines by 14.94% accuracy. We confirm that the model can (1) generalize to unseen users from different areas and times, and (2) fall back on collective trends in the cold-start scenario. We also study the relative contributions of various factors in making the predictions and find that the users' visitation preferences and most-recent visitation sequences play more important roles than time contexts, same-hour sequences, and social influence features.

Followback Clusters, Satellite Audiences, and Bridge Nodes: Coengagement Networks for the 2020 US Election

Fri, 02 Jun 2023 00:00:00 -0700

The 2020 United States (US) presidential election was — and has continued to be — the focus of pervasive and persistent mis- and disinformation spreading through our media ecosystems, including social media. This event has driven the collection and analysis of large, directed social network datasets, but such datasets can resist intuitive understanding. In such large datasets, the overwhelming number of nodes and edges present in typical representations create visual artifacts, such as densely overlapping edges and tightly-packed formations of low-degree nodes, which obscure many features of more practical interest. We apply a method, coengagement transformations, to convert such networks of social data into tractable images. Intuitively, this approach allows for parameterized network visualizations that make shared audiences of engaged viewers salient to viewers. Using the interpretative capabilities of this method, we perform an extensive case study of the 2020 United States presidential election on Twitter, contributing an empirical analysis of coengagement. By creating and contrasting different networks at different parameter sets, we define and characterize several structures in this discourse network, including bridging accounts, satellite audiences, and followback communities. We discuss the importance and implications of these empirical network features in this context. In addition, we release open-source code for creating coengagement networks from Twitter and other structured interaction data.

Measuring the Ideology of Audiences for Web Links and Domains Using Differentially Private Engagement Data

Cody Buntain, Richard Bonneau, Jonathan Nagler, Joshua A. Tucker — Fri, 02 Jun 2023 00:00:00 -0700

This paper demonstrates the use of differentially private hyperlink-level engagement data for measuring ideologies of audiences for web domains, individual links, or aggregations thereof. We examine a simple metric for measuring this ideological position and assess the conditions under which the metric is robust to injected, privacy-preserving noise. This assessment provides insights into and constraints on the level of activity one should observe when applying this metric to privacy-protected data. Grounding this work is a massive dataset of social media engagement activity where privacy-preserving noise has been injected into the activity data, provided by Facebook and the Social Science One (SS1) consortium. Using this dataset, we validate our ideology measures by comparing to similar, published work on sharing-based, homophily- and content-oriented measures, where we show consistently high correlation (>0.87). We then apply this metric to individual links from several popular news domains and demonstrate how one can assess link-level distributions of ideological audiences. We further show this estimator is robust to selection of engagement types besides sharing, where domain-level audience-ideology assessments based on views and likes show no significant difference compared to sharing-based estimates. Estimates of partisanship, however, suggest the viewing audience is more moderate than the audiences who share and like these domains. Beyond providing thresholds on sufficient activity for measuring audience ideology and comparing three types of engagement, this analysis provides a blueprint for ensuring robustness of future work to differential privacy protections.

RTANet: Recommendation Target-Aware Network Embedding

Fri, 02 Jun 2023 00:00:00 -0700

Network embedding is a process of encoding nodes into latent vectors by preserving network structure and content information. It is used in various applications, especially in recommender systems. In a social network setting, when recommending new friends to a user, the similarity between the user's embedding and the target friend will be examined. Traditional methods generate user node embedding without considering the recommendation target. No matter which target is to be recommended, the same embedding vector is generated for that particular user. This approach has its limitations. For example, a user can be both a computer scientist and a musician. When recommending music friends with potentially the same taste to him, we are interested in getting his representation that is useful in recommending music friends rather than computer scientists. His corresponding embedding should consider the user's musical features rather than those associated with computer science with the awareness that the recommendation targets are music friends. In order to address this issue, we propose a new framework which we name it as Recommendation Target-Aware Network embedding method (RTANet). Herein, the embedding of each user is no longer fixed to a constant vector, but it can vary according to their specific recommendation target. Concretely, RTANet assigns different attention weights to each neighbour node, allowing us to obtain the user's context information aggregated from its neighbours before transforming this context into its embedding. Different from other graph attention approaches, the attention weights in our work measure the similarity between each user's neighbour node and the target node, which in return generates the target-aware embedding. To demonstrate the effectiveness of our method, we compared RTANet with several state-of-the-art network embedding methods on four real-world datasets and showed that RTANet outperforms other comparative methods in the recommendation tasks.

Recipe Networks and the Principles of Healthy Food on the Web

Charalampos Chelmis, Bedirhan Gergin — Fri, 02 Jun 2023 00:00:00 -0700

People increasingly use the Internet to make food-related choices, prompting research on food recommendation systems. Recently, works that incorporate nutritional constraints into the recommendation process have been proposed to promote healthier recipes. Ingredient substitution is also used, particularly by people motivated to reduce the intake of a specific nutrient or in order to avoid a particular category of ingredients due for instance to allergies. This study takes a complementary approach towards empowering people to make healthier food choices by simplifying the process of identifying plausible recipe substitutions. To achieve this goal, this work constructs a large-scale network of similar recipes, and analyzes this network to reveal interesting properties that have important implications to the development of food recommendation systems.

Partisan US News Media Representations of Syrian Refugees

Fri, 02 Jun 2023 00:00:00 -0700

We investigate how representations of Syrian refugees (2011-2021) differ across US partisan news outlets. We analyze 47,388 articles from the online US media about Syrian refugees to detail differences in reporting between left- and right-leaning media. We use various NLP techniques to understand these differences. Our polarization and question answering results indicated that left-leaning media tended to represent refugees as child victims, welcome in the US, and right-leaning media cast refugees as Islamic terrorists. We noted similar results with our sentiment and offensive speech scores over time, which detail possibly unfavorable representations of refugees in right-leaning media. A strength of our work is how the different techniques we have applied validate each other. Based on our results, we provide several recommendations. Stakeholders may utilize our findings to intervene around refugee representations, and design communications campaigns that improve the way society sees refugees and possibly aid refugee outcomes.

DiPPS: Differentially Private Propensity Scores for Bias Correction

Liangwei Chen, Valentin Hartmann, Robert West — Fri, 02 Jun 2023 00:00:00 -0700

In surveys, it is typically up to the individuals to decide if they want to participate or not, which leads to participation bias: the individuals willing to share their data might not be representative of the entire population. Similarly, there are cases where one does not have direct access to any data of the target population and has to resort to publicly available proxy data sampled from a different distribution. In this paper, we present Differentially Private Propensity Scores for Bias Correction (DiPPS), a method for approximating the true data distribution of interest in both of the above settings. We assume that the data analyst has access to a dataset D' that was sampled from the distribution of interest in a biased way. As individuals may be more willing to share their data when given a privacy guarantee, we further assume that the analyst is allowed locally differentially private access to a set of samples D from the true, unbiased distribution. Each data point from the private, unbiased dataset D is mapped to a probability distribution over clusters (learned from the biased dataset D'), from which a single cluster is sampled via the exponential mechanism and shared with the data analyst. This way, the analyst gathers a distribution over clusters, which they use to compute propensity scores for the points in the biased D', which are in turn used to reweight the points in D' to approximate the true data distribution. It is now possible to compute any function on the resulting reweighted dataset without further access to the private D. In experiments on datasets from various domains, we show that DiPPS successfully brings the distribution of the available dataset closer to the distribution of interest in terms of Wasserstein distance. We further show that this results in improved estimates for different statistics, in many cases even outperforming differential privacy mechanisms that are specifically designed for these statistics.

Getting Back on Track: Understanding COVID-19 Impact on Urban Mobility and Segregation with Location Service Data

Lin Chen, Fengli Xu, Qianyue Hao, Pan Hui, Yong Li — Fri, 02 Jun 2023 00:00:00 -0700

Understanding the impact of COVID-19 on urban life rhythms is crucial for accelerating the return-to-normal progress and envisioning more resilient and inclusive cities. While previous studies either depended on small-scale surveys or focused on the response to initial lockdowns, this paper uses large-scale location service data to systematically analyze the urban mobility behavior changes across three distinct phases of the pandemic, i.e., pre-pandemic, lockdown, and reopen. Our analyses reveal two typical patterns that govern the mobility behavior changes in most urban venues: daily life-centered urban venues go through smaller mobility drops during the lockdown and more rapid recovery after reopening, while work-centered urban venues suffer from more significant mobility drops that are likely to persist even after reopening. Such mobility behavior changes exert deeper impacts on the underlying social fabric, where the level of mobility reduction is positively correlated with the experienced segregation at that urban venue. Therefore, urban venues undergoing more mobility reduction are also more filled with people from homogeneous socio-demographic backgrounds. Moreover, mobility behavior changes display significant heterogeneity across geographical regions, which can be largely explained by the partisan inclination at the state level. Our study shows the vast potential of location service data in deriving a timely and comprehensive understanding of the social dynamic in urban space, which is valuable for informing the gradual transition back to the normal lifestyle in a “post-pandemic era”.

What Are You Anxious About? Examining Subjects of Anxiety during the COVID-19 Pandemic

Lucia L. Chen, Steven R. Wilson, Sophie Lohmann, Daniela V. Negraia — Fri, 02 Jun 2023 00:00:00 -0700

COVID-19 poses disproportionate mental health consequences to the public during different phases of the pandemic. We use a computational approach to capture the specific aspects that trigger the public's anxiety about the pandemic and investigate how these aspects change over time. First, we identified nine subjects of anxiety (SOAs) in a sample of Reddit posts (N=86) from r/COVID19\_support using the thematic analysis approach. Then, we quantified Reddit users' anxiety by training algorithms on a manually annotated sample (N=793) to annotate the SOAs in a larger chronological sample (N=6,535). The nine SOAs align with items in various recently developed pandemic anxiety measurement scales. We observed that Reddit users' concerns about health risks remained high in the first eight months since the pandemic started. These concerns diminished dramatically despite the surge of cases occurring later. In general, users' language disclosing the SOAs became less intense as the pandemic progressed. However, worries about mental health and the future steadily increased throughout the period covered in this study. People also tended to use more intense language to describe mental health concerns than health risk or death concerns. Our results suggest that the public's mental health condition does not necessarily improve despite COVID-19 as a health threat gradually weakening due to appropriate countermeasures. Our system lays the groundwork for population health and epidemiology scholars to examine aspects that provoke pandemic anxiety in a timely fashion.

Analyzing the Engagement of Social Relationships during Life Event Shocks in Social Media

Minje Choi, David Jurgens, Daniel M. Romero — Fri, 02 Jun 2023 00:00:00 -0700

Individuals experiencing unexpected distressing events, shocks, often rely on their social network for support. While prior work has shown how social networks respond to shocks, these studies usually treat all ties equally, despite differences in the support provided by different social relationships. Here, we conduct a computational analysis on Twitter that examines how responses to online shocks differ by the relationship type of a user dyad. We introduce a new dataset of over 13K instances of individuals' self-reporting shock events on Twitter and construct networks of relationship-labeled dyadic interactions around these events. By examining behaviors across 110K replies to shocked users in a pseudo-causal analysis, we demonstrate relationship-specific patterns in response levels and topic shifts. We also show that while well-established social dimensions of closeness such as tie strength and structural embeddedness contribute to shock responsiveness, the degree of impact is highly dependent on relationship and shock types. Our findings indicate that social relationships contain highly distinctive characteristics in network interactions, and that relationship-specific behaviors in online shock responses are unique from those of offline settings.

Same Words, Different Meanings: Semantic Polarization in Broadcast Media Language Forecasts Polarity in Online Public Discourse

Xiaohan Ding, Michael Horning, Eugenia H. Rho — Fri, 02 Jun 2023 00:00:00 -0700

With the growth of online news over the past decade, empirical studies on political discourse and news consumption have focused on the phenomenon of filter bubbles and echo chambers. Yet recently, scholars have revealed limited evidence around the impact of such phenomenon, leading some to argue that partisan segregation across news audiences can- not be fully explained by online news consumption alone and that the role of traditional legacy media may be as salient in polarizing public discourse around current events. In this work, we expand the scope of analysis to include both online and more traditional media by investigating the relationship between broadcast news media language and social media discourse. By analyzing a decade’s worth of closed captions (2.1 million speaker turns) from CNN and Fox News along with topically corresponding discourse from Twitter, we pro- vide a novel framework for measuring semantic polarization between America’s two major broadcast networks to demonstrate how semantic polarization between these outlets has evolved (Study 1), peaked (Study 2) and influenced partisan discussions on Twitter (Study 3) across the last decade. Our results demonstrate a sharp increase in polarization in how topically important keywords are discussed between the two channels, especially after 2016, with overall highest peaks occurring in 2020. The two stations discuss identical topics in drastically distinct contexts in 2020, to the extent that there is barely any linguistic overlap in how identical keywords are contextually discussed. Further, we demonstrate at-scale, how such partisan division in broadcast media language significantly shapes semantic polarity trends on Twitter (and vice-versa), empirically linking for the first time, how online discussions are influenced by televised media. We show how the language characterizing opposing media narratives about similar news events on TV can increase levels of partisan dis- course online. To this end, our work has implications for how media polarization on TV plays a significant role in impeding rather than supporting online democratic discourse.

Catch Me If You Can: Deceiving Stance Detection and Geotagging Models to Protect Privacy of Individuals on Twitter

Dilara Dogan, Bahadir Altun, Muhammed Said Zengin, Mucahid Kutlu, Tamer Elsayed — Fri, 02 Jun 2023 00:00:00 -0700

The recent advances in natural language processing have yielded many exciting developments in text analysis and language understanding models; however, these models can also be used to track people, bringing severe privacy concerns. In this work, we investigate what individuals can do to avoid being detected by those models while using social media platforms. We ground our investigation in two exposure-risky tasks, stance detection and geotagging. We explore a variety of simple techniques for modifying text, such as inserting typos in salient words, paraphrasing, and adding dummy social media posts. Our experiments show that the performance of BERT-based models fine-tuned for stance detection decreases significantly due to typos, but it is not affected by paraphrasing. Moreover, we find that typos have minimal impact on state-of-the-art geotagging models due to their increased reliance on social networks; however, we show that users can deceive those models by interacting with different users, reducing their performance by almost 50%.

We Are in This Together: Quantifying Community Subjective Wellbeing and Resilience

MeiXing Dong, Ruixuan Sun, Laura Biester, Rada Mihalcea — Fri, 02 Jun 2023 00:00:00 -0700

The COVID-19 pandemic disrupted everyone's life across the world. In this work, we characterize the subjective wellbeing patterns of 112 cities across the United States during the pandemic prior to vaccine availability, as exhibited in subreddits corresponding to the cities. We quantify subjective wellbeing using positive and negative affect. We then measure the pandemic's impact by comparing a community's observed wellbeing with its expected wellbeing, as forecasted by time series models derived from prior to the pandemic. We show that general community traits reflected in language can be predictive of community resilience. We predict how the pandemic would impact the wellbeing of each community based on linguistic and interaction features from normal times before the pandemic. We find that communities with interaction characteristics corresponding to more closely connected users and higher engagement were less likely to be significantly impacted. Notably, we find that communities that talked more about social ties normally experienced in-person, such as friends, family, and affiliations, were actually more likely to be impacted. Additionally, we use the same features to also predict how quickly each community would recover after the initial onset of the pandemic. We similarly find that communities that talked more about family, affiliations, and identifying as part of a group had a slower recovery.

Non-polar Opposites: Analyzing the Relationship between Echo Chambers and Hostile Intergroup Interactions on Reddit

Fri, 02 Jun 2023 00:00:00 -0700

Previous research has documented the existence of both online echo chambers and hostile intergroup interactions. In this paper, we explore the relationship between these two phenomena by studying the activity of 5.97M Reddit users and 421M comments posted over 13 years. We examine whether users who are more engaged in echo chambers are more hostile when they comment on other communities. We then create a typology of relationships between political communities based on whether their users are toxic to each other, whether echo chamber-like engagement with these communities has a polarizing effect, and on the communities' political leanings. We observe both the echo chamber and hostile intergroup interaction phenomena, but neither holds universally across communities. Contrary to popular belief, we find that polarizing and toxic speech is more dominant between communities on the same, rather than opposing, sides of the political spectrum, especially on the left; however, this mostly points to the collective targeting of political outgroups.

Misleading Repurposing on Twitter

Tuğrulcan Elmas, Rebekah Overdorf, Karl Aberer — Fri, 02 Jun 2023 00:00:00 -0700

We present the first in-depth and large-scale study of misleading repurposing, in which a malicious user changes the identity of their social media account via, among other things, changes to the profile attributes in order to use the account for a new purpose while retaining their followers. We propose a definition for the behavior and a methodology that uses supervised learning on data mined from the Internet Archive's Twitter Stream Grab to flag repurposed accounts. We found over 100,000 accounts that may have been repurposed. Of those, 28% were removed from the platform after 2 years, thereby confirming their inauthenticity. We also characterize repurposed accounts and found that they are more likely to be repurposed after a period of inactivity and deleting old tweets. We also provide evidence that adversaries target accounts with high follower counts to repurpose, and some make them have high follower counts by participating in follow-back schemes. The results we present have implications for the security and integrity of social media platforms, for data science studies in how historical data is considered, and for society at large in how users can be deceived about the popularity of an opinion. The data and the code is available at https://github.com/tugrulz/MisleadingRepurposing.

Scope of Pre-trained Language Models for Detecting Conflicting Health Information

Joseph Gatto, Madhusudan Basak, Sarah Masud Preum — Fri, 02 Jun 2023 00:00:00 -0700

An increasing number of people now rely on online platforms to meet their health information needs. Thus identifying inconsistent or conflicting textual health information has become a safety-critical task. Health advice data poses a unique challenge where information that is accurate in the context of one diagnosis can be conflicting in the context of another. For example, people suffering from diabetes and hypertension often receive conflicting health advice on diet. This motivates the need for technologies which can provide contextualized, user-specific health advice. A crucial step towards contextualized advice is the ability to compare health advice statements and detect if and how they are conflicting. This is the task of health conflict detection (HCD). Given two pieces of health advice, the goal of HCD is to detect and categorize the type of conflict. It is a challenging task, as (i) automatically identifying and categorizing conflicts requires a deeper understanding of the semantics of the text, and (ii) the amount of available data is quite limited. In this study, we are the first to explore HCD in the context of pre-trained language models. We find that DeBERTa-v3 performs best with a mean F1 score of 0.68 across all experiments. We additionally investigate the challenges posed by different conflict types and how synthetic data improves a model's understanding of conflict-specific semantics. Finally, we highlight the difficulty in collecting real health conflicts and propose a human-in-the-loop synthetic data augmentation approach to expand existing HCD datasets. Our HCD training dataset is over 2x bigger than the existing HCD dataset and is made publicly available on Github.

Author as Character and Narrator: Deconstructing Personal Narratives from the r/AmITheAsshole Reddit Community

Salvatore Giorgi, Ke Zhao, Alexander H. Feng, Lara J. Martin — Fri, 02 Jun 2023 00:00:00 -0700

In the r/AmITheAsshole subreddit, people anonymously share first person narratives that contain some moral dilemma or conflict and ask the community to judge who is at fault (i.e., who is "the asshole"). These first person narratives are, in general, a unique storytelling domain where the author is not only the narrator (the person telling the story) but is also a character (the person living the story) and, thus, the author has two distinct voices presented in the story. In this study, we identify linguistic and narrative features associated with the author as the character or as a narrator. We use these features to answer the following questions: (1) what makes an asshole character and (2) what makes an asshole narrator? We extract both Author-as-Character features (e.g., demographics, narrative event chain, and emotional arc) and Author-as-Narrator features (i.e., the style and emotion of the story as a whole) in order to identify which aspects of the narrative are correlated with the final moral judgment. Our work shows that "assholes" as Characters frame themselves as lacking agency with a more positive personal arc, while "assholes" as Narrators will tell emotional and opinionated stories.

Google the Gatekeeper: How Search Components Affect Clicks and Attention

Jeffrey Gleason, Desheng Hu, Ronald E. Robertson, Christo Wilson — Fri, 02 Jun 2023 00:00:00 -0700

The contemporary Google Search Engine Results Page (SERP) supplements classic blue hyperlinks with complex components. These components produce tensions between searchers, 3rd-party websites, and Google itself over clicks and attention. In this study, we examine 12 SERP components from two categories: (1) extracted results (e.g., featured-snippets) and (2) Google Services (e.g., shopping-ads) to determine their effect on peoples’ behavior. We measure behavior with two variables: (1) click- through rate (CTR) to Google’s own domains versus 3rd-party domains and (2) time spent on the SERP. We apply causal inference methods to an ecologically valid trace dataset comprising 477,485 SERPs from 1,756 participants. We find that multiple components substantially increase CTR to Google domains, while others decrease CTR and increase time on the SERP. These findings may inform efforts to regulate the design of powerful intermediary platforms like Google.

Understanding and Detecting Hateful Content Using Contrastive Learning

Felipe González-Pizarro, Savvas Zannettou — Fri, 02 Jun 2023 00:00:00 -0700

The spread of hate speech and hateful imagery on the Web is a significant problem that needs to be mitigated to improve our Web experience. This work contributes to research efforts to detect and understand hateful content on the Web by undertaking a multimodal analysis of Antisemitism and Islamophobia on 4chan’s /pol/ using OpenAI’s CLIP. This large pre-trained model uses the Contrastive Learning paradigm. We devise a methodology to identify a set of Antisemitic and Islamophobic hateful textual phrases using Google’s Perspective API and manual annotations. Then, we use OpenAI’s CLIP to identify images that are highly similar to our Antisemitic/Islamophobic textual phrases. By running our methodology on a dataset that includes 66M posts and 5.8M images shared on 4chan’s /pol/ for 18 months, we detect 173K posts containing 21K Antisemitic/Islamophobic images and 246K posts that include 420 hateful phrases. Among other things, we find that we can use OpenAI’s CLIP model to detect hateful content with an accuracy score of 0.81 (F1 score = 0.54). By comparing CLIP with two baselines proposed by the literature, we find that CLIP outperforms them, in terms of accuracy, precision, and F1 score, in detecting Antisemitic/Islamophobic images. Also, we find that Antisemitic/Islamophobic imagery is shared in a similar number of posts on 4chan’s /pol/ compared to Antisemitic/Islamophobic textual phrases, highlighting the need to design more tools for detecting hateful imagery. Finally, we make available (upon request) a dataset of 246K posts containing 420 Antisemitic/Islamophobic phrases and 21K likely Antisemitic/Islamophobic images (automatically detected by CLIP) that can assist researchers in further understanding Antisemitism and Islamophobia.

SciLander: Mapping the Scientific News Landscape

Maurício Gruppi, Panayiotis Smeros, Sibel Adalı, Carlos Castillo, Karl Aberer — Fri, 02 Jun 2023 00:00:00 -0700

The COVID-19 pandemic has fueled the spread of misinformation on social media and the Web as a whole. The phenomenon dubbed `infodemic' has taken the challenges of information veracity and trust to new heights by massively introducing seemingly scientific and technical elements into misleading content. Despite the existing body of work on modeling and predicting misinformation, the coverage of very complex scientific topics with inherent uncertainty and an evolving set of findings, such as COVID-19, provides many new challenges that are not easily solved by existing tools. To address these issues, we introduce SciLander, a method for learning representations of news sources reporting on science-based topics. We extract four heterogeneous indicators for the sources; two generic indicators that capture (1) the copying of news stories between sources, and (2) the use of the same terms to mean different things (semantic shift), and two scientific indicators that capture (1) the usage of jargon and (2) the stance towards specific citations. We use these indicators as signals of source agreement, sampling pairs of positive (similar) and negative (dissimilar) samples, and combine them in a unified framework to train unsupervised news source embeddings with a triplet margin loss objective. We evaluate our method on a novel COVID-19 dataset containing nearly 1M news articles from 500 sources spanning a period of 18 months since the beginning of the pandemic in 2020. Our results show that the features learned by our model outperform state-of-the-art baseline methods on the task of news veracity classification. Furthermore, a clustering analysis suggests that the learned representations encode information about the reliability, political leaning, and partisanship bias of these sources.

A Data Fusion Framework for Multi-Domain Morality Learning

Siyi Guo, Negar Mokhberian, Kristina Lerman — Fri, 02 Jun 2023 00:00:00 -0700

Language models can be trained to recognize the moral sentiment of text, creating new opportunities to study the role of morality in human life. As interest in language and morality has grown, several ground truth datasets with moral annotations have been released. However, these datasets vary in the method of data collection, domain, topics, instructions for annotators, etc. Simply aggregating such heterogeneous datasets during training can yield models that fail to generalize well. We describe a data fusion framework for training on multiple heterogeneous datasets that improve performance and generalizability. The model uses domain adversarial training to align the datasets in feature space and a weighted loss function to deal with label shift. We show that the proposed framework achieves state-of-the-art performance in different datasets compared to prior works in morality inference.

Representing and Determining Argumentative Relevance in Online Discussions: A General Approach

Zhen Guo, Munindar P. Singh — Fri, 02 Jun 2023 00:00:00 -0700

Understanding an online argumentative discussion is essential for understanding users' opinions on a topic and their underlying reasoning. A key challenge in determining completeness and persuasiveness of argumentative discussions is to assess how arguments under a topic are connected in a logical and coherent manner. Online argumentative discussions, in contrast to essays or face-to-face communication, challenge techniques for judging argument relevance because online discussions involve multiple participants and often exhibit incoherence in reasoning and inconsistencies in writing style. We define relevance as the logical and topical connections between small texts representing argument fragments in online discussions. We provide a corpus comprising pairs of sentences, labeled with argumentative relevance between the sentences in each pair. We propose a computational approach relying on content reduction and a Siamese neural network architecture for modeling argumentative connections and determining argumentative relevance between texts. Experimental results indicate that our approach is effective in measuring relevance between arguments, and outperforms strong and well-adopted baselines. Further analysis demonstrates the benefit of using our argumentative relevance encoding on a downstream task, predicting how impactful an online comment is to certain topic, comparing to encoding that does not consider logical connection.

The Morbid Realities of Social Media: An Investigation into the Narratives Shared by the Deceased Victims of COVID-19

Hussam Habib, Rishab Nithyanand — Fri, 02 Jun 2023 00:00:00 -0700

Social media platforms have had considerable impact on the real world especially during the Covid-19 pandemic. Problematic narratives related to Covid-19 might have caused significant impact on the population specifically due to its association with dangerous beliefs such as anti-vaccination and Covid denial. In this work, we study a unique dataset of Facebook posts by users who shared and believed in such narratives before succumbing to Covid-19 often resulting in death. We aim to characterize the dominant themes and sources present in the victim's posts along with identifying the role of the platform in handling deadly narratives. Our analysis reveals the overwhelming politicization of Covid-19 through the prevalence of anti-government themes propagated by right-wing political and media ecosystem. Furthermore, we highlight the efforts of Facebook's implementation of soft moderation actions intended to warn users of misinformation. Results from this study bring insights into the responsibility of political elites in shaping public discourse and the platform's role in dampening the reach of harmful narratives.

Motif-Based Exploratory Data Analysis for State-Backed Platform Manipulation on Twitter

Khuzaima Hameed, Rob Johnston, Brent Younce, Minh Tang, Alyson Wilson — Fri, 02 Jun 2023 00:00:00 -0700

State-backed platform manipulation (SBPM) on Twitter has been a prominent public issue since the 2016 US election cycle. Identifying and characterizing users on Twitter as belonging to a state-backed campaign is an important part of mitigating their inﬂuence. In this paper, we propose a novel time series feature grounded in social science to characterize dynamic user networks on Twitter. We introduce a classiﬁcation approach, motif functional data analysis (MFDA), that captures the evolution of motifs in temporal networks, which is a useful feature for analyzing malign inﬂuence. We evaluate MFDA on data from known SBPM campaigns on Twitter and representative authentic data and compare performance to other classiﬁcation methods. To further leverage our dynamic feature, we use the changes in network structure captured by motifs to help uncover real-world events using anomaly detection.

Happenstance: Utilizing Semantic Search to Track Russian State Media Narratives about the Russo-Ukrainian War on Reddit

Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric — Fri, 02 Jun 2023 00:00:00 -0700

In the buildup to and in the weeks following the Russian Federation’s invasion of Ukraine, Russian state media outlets output torrents of misleading and outright false information. In this work, we study this coordinated information campaign in order to understand the most prominent state media narratives touted by the Russian government to English-speaking audiences. To do this, we first perform sentence-level topic analysis using the large-language model MPNet on articles published by ten different pro-Russian propaganda websites including the new Russian “fact-checking” website waronfakes.com. Within this ecosystem, we show that smaller websites like katehon.com were highly effective at publishing topics that were later echoed by other Russian sites. After analyzing this set of Russian information narratives, we then analyze their correspondence with narratives and topics of discussion on r/Russia and 10 other political subreddits. Using MPNet and a semantic search algorithm, we map these subreddits’ comments to the set of topics extracted from our set of Russian websites, finding that 39.6% of r/Russia comments corresponded to narratives from pro-Russian propaganda websites compared to 8.86% on r/politics.

"A Special Operation": A Quantitative Approach to Dissecting and Comparing Different Media Ecosystems’ Coverage of the Russo-Ukrainian War

Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric — Fri, 02 Jun 2023 00:00:00 -0700

The coverage of the Russian invasion of Ukraine has varied widely between Western, Russian, and Chinese media ecosystems with propaganda, disinformation, and narrative spins present in all three. By utilizing the normalized pointwise mutual information metric, differential sentiment analysis, word2vec models, and partially labeled Dirichlet allocation, we present a quantitative analysis of the differences in coverage amongst these three news ecosystems. We find that while the Western press outlets have focused on the military and humanitarian aspects of the war, Russian media have focused on the purported justifications for the “special military operation” such as the presence in Ukraine of “bio-weapons” and “neo-nazis”, and Chinese news media have concentrated on the conflict’s diplomatic and economic consequences. Detecting the presence of several Russian disinformation narratives in the articles of several Chinese media outlets, we finally measure the degree to which Russian media has influenced Chinese coverage across Chinese outlets’ news articles, Weibo accounts, and Twitter accounts. Our analysis indicates that since the Russian invasion of Ukraine, Chinese state media outlets have increasingly cited Russian outlets as news sources and spread Russian disinformation narratives.

The Geography of Facebook Groups in the United States

Amaç Herdağdelen, Lada Adamic, Bogdan State — Fri, 02 Jun 2023 00:00:00 -0700

We present a de-identified and aggregated dataset based on geographical patterns of Facebook Groups usage and demonstrate its association with measures of social capital. The dataset is aggregated at United States county level. Established spatial measures of social capital are known to vary across US counties. Their availability and recency depends on running costly surveys. We examine to what extent a dataset based on usage patterns of Facebook Groups, which can be generated at regular intervals, could be used as a partial proxy by capturing local online associations. We identify four main latent factors that distinguish Facebook group engagement by county, obtained by exploratory factor analysis. The first captures small and private groups, dense with friendship connections. The second captures very local and small groups. The third captures non-local, large, public groups, with more age mixing. The fourth captures partially local groups of medium to large size. Only two of these factors, the first and third, correlate with offline community level social capital measures, while the second and fourth do not. Together and individually, the factors are predictive of offline social capital measures, even controlling for various demographic attributes of the counties. To our knowledge this is the first systematic test of the association between offline regional social capital and patterns of online community engagement in the same regions. By making the dataset available to the research community, we hope to contribute to the ongoing studies in social capital.

Quotatives Indicate Decline in Objectivity in U.S. Political News

Tiancheng Hu, Manoel Horta Ribeiro, Robert West, Andreas Spitz — Fri, 02 Jun 2023 00:00:00 -0700

According to journalistic standards, direct quotes should be attributed to sources with objective quotatives such as ``said'' and ``told,'' since nonobjective quotatives, e.g., ``argued'' and ``insisted,'' would influence the readers' perception of the quote and the quoted person. In this paper, we analyze the adherence to this journalistic norm to study trends in objectivity in political news across U.S. outlets of different ideological leanings. We ask: 1) How has the usage of nonobjective quotatives evolved? 2) How do news outlets use nonobjective quotatives when covering politicians of different parties? To answer these questions, we developed a dependency-parsing-based method to extract quotatives and applied it to Quotebank, a web-scale corpus of attributed quotes, obtaining nearly 7 million quotes, each enriched with the quoted speaker's political party and the ideological leaning of the outlet that published the quote. We find that, while partisan outlets are the ones that most often use nonobjective quotatives, between 2013 and 2020, the outlets that increased their usage of nonobjective quotatives the most were ``moderate'' centrist news outlets (around 0.6 percentage points, or 20% in relative percentage over seven years). Further, we find that outlets use nonobjective quotatives more often when quoting politicians of the opposing ideology (e.g., left-leaning outlets quoting Republicans) and that this ``quotative bias'' is rising at a swift pace, increasing up to 0.5 percentage points, or 25% in relative percentage, per year. These findings suggest an overall decline in journalistic objectivity in U.S. political news.

Information Retention in the Multi-Platform Sharing of Science

Sohyeon Hwang, Emőke-Ágnes Horvát, Daniel M. Romero — Fri, 02 Jun 2023 00:00:00 -0700

The public interest in accurate scientific communication, underscored by recent public health crises, highlights how content often loses critical pieces of information as it spreads online. However, multi-platform analyses of this phenomenon remain limited due to challenges in data collection. Collecting mentions of research tracked by Altmetric LLC, we examine information retention in the over 4 million online posts referencing 9,765 of the most-mentioned scientific articles across blog sites, Facebook, news sites, Twitter, and Wikipedia. To do so, we present a burst-based framework for examining online discussions about science over time and across different platforms. To measure information retention, we develop a keyword-based computational measure comparing an online post to the scientific article's abstract. We evaluate our measure using ground truth data labeled by within field experts. We highlight three main findings: first, we find a strong tendency towards low levels of information retention, following a distinct trajectory of loss except when bursts of attention begin in social media. Second, platforms show significant differences in information retention. Third, sequences involving more platforms tend to be associated with higher information retention. These findings highlight a strong tendency towards information loss over time---posing a critical concern for researchers, policymakers, and citizens alike---but suggest that multi-platform discussions may improve information retention overall.

Measuring Belief Dynamics on Twitter

Joshua Introne — Fri, 02 Jun 2023 00:00:00 -0700

There is growing concern about misinformation and the role online media plays in social polarization. Analyzing belief dynamics is one way to enhance our understanding of these problems. Existing analytical tools, such as sur-vey research or stance detection, lack the power to corre-late contextual factors with population-level changes in belief dynamics. In this exploratory study, I present the Belief Landscape Framework, which uses data about people’s professed beliefs in an online setting to measure belief dynamics with more temporal granularity than previous methods. I apply the approach to conversations about climate change on Twitter and provide initial validation by comparing the method’s output to a set of hypotheses drawn from the literature on dynamic systems. My analysis indicates that the method is relatively robust to different parameter settings, and results suggest that 1) there are many stable configurations of belief on the polarizing issue of climate change and 2) that people move in predictable ways around these points. The method paves the way for more powerful tools that can be used to understand how the modern digital media eco-system impacts collective belief dynamics and what role misinformation plays in that process.

Lady and the Tramp Nextdoor: Online Manifestations of Real-World Inequalities in the Nextdoor Social Network

Fri, 02 Jun 2023 00:00:00 -0700

From health to education, income impacts a huge range of life choices. Earlier research has leveraged data from online social networks to study precisely this impact. In this paper, we ask the opposite question: do different levels of income result in different online behaviors? We demonstrate it does. We present the first large-scale study of Nextdoor, a popular location-based social network. We collect 2.6 Million posts from 64,283 neighborhoods in the United States and 3,325 neighborhoods in the United Kingdom, to examine whether online discourse reflects the income and income inequality of a neighborhood. We show that posts from neighborhoods with different incomes indeed differ, e.g. richer neighborhoods have a more positive sentiment and discuss crimes more, even though their actual crime rates are much lower. We then show that user-generated content can predict both income and inequality. We train multiple machine learning models and predict both income (R2=0.841) and inequality (R2=0.77).

Weakly Supervised Learning for Analyzing Political Campaigns on Facebook

Tunazzina Islam, Shamik Roy, Dan Goldwasser — Fri, 02 Jun 2023 00:00:00 -0700

Social media platforms are currently the main channel for political messaging, allowing politicians to target specific demographics and adapt based on their reactions. However, making this communication transparent is challenging, as the messaging is tightly coupled with its intended audience and often echoed by multiple stakeholders interested in advancing specific policies. Our goal in this paper is to take a first step towards understanding these highly decentralized settings. We propose a weakly supervised approach to identify the stance and issue of political ads on Facebook and analyze how political campaigns use some kind of demographic targeting by location, gender, or age. Furthermore, we analyze the temporal dynamics of the political ads on election polls.

Online Emotions during the Storming of the U.S. Capitol: Evidence from the Social Media Network Parler

Fri, 02 Jun 2023 00:00:00 -0700

The storming of the U.S. Capitol on January 6, 2021 has led to the killing of 5 people and is widely regarded as an attack on democracy. The storming was largely coordinated through social media networks such as Twitter and "Parler". Yet little is known regarding how users interacted on Parler during the storming of the Capitol. In this work, we examine the emotion dynamics on Parler during the storming with regard to heterogeneity across time and users. For this, we segment the user base into different groups (e.g., Trump supporters and QAnon supporters). We use affective computing to infer the emotions in content, thereby allowing us to provide a comprehensive assessment of online emotions. Our evaluation is based on a large-scale dataset from Parler, comprising of 717,300 posts from 144,003 users. We find that the user base responded to the storming of the Capitol with an overall negative sentiment. Akin to this, Trump supporters also expressed a negative sentiment and high levels of unbelief. In contrast to that, QAnon supporters did not express a more negative sentiment during the storming. We further provide a cross-platform analysis and compare the emotion dynamics on Parler and Twitter. Our findings point at a comparatively less negative response to the incidents on Parler compared to Twitter accompanied by higher levels of disapproval and outrage. Our contribution to research is three-fold: (1) We identify online emotions that were characteristic of the storming; (2) we assess emotion dynamics across different user groups on Parler; (3) we compare the emotion dynamics on Parler and Twitter. Thereby, our work offers important implications for actively managing online emotions to prevent similar incidents in the future.

Effect of Feedback on Drug Consumption Disclosures on Social Media

Hitkul Jangra, Rajiv Shah, Ponnurangam Kumaraguru — Fri, 02 Jun 2023 00:00:00 -0700

Deaths due to drug overdose in the US have doubled in the last decade. Drug-related content on social media has also exploded in the same time frame. The pseudo-anonymous nature of social media platforms enables users to discourse about taboo and sometimes illegal topics like drug consumption. User-generated content (UGC) about drugs on social media can be used as an online proxy to detect offline drug consumption. UGC also gets exposed to the praise and criticism of the community. Law of effect proposes that positive reinforcement on an experience can incentivize the users to engage in the experience repeatedly. Therefore, we hypothesize that positive community feedback on a user's online drug consumption disclosure will increase the probability of the user doing an online drug consumption disclosure post again. To this end, we collect data from 10 drug-related subreddits. First, we build a deep learning model to classify UGC as indicative of drug consumption offline or not, and analyze the extent of such activities. Further, we use matching-based causal inference techniques to unravel community feedback's effect on users' future drug consumption behavior. We discover that 84% of posts and 55% comments on drug-related subreddits indicate real-life drug consumption. Users who get positive feedback generate up to two times more drugs consumption content in the future. Finally, we conducted an anonymous user study on drug-related subreddits to compare members' opinions with our experimental findings and show that user tends to underestimate the effect community peers can have on their decision to interact with drugs.

SexWEs: Domain-Aware Word Embeddings via Cross-Lingual Semantic Specialisation for Chinese Sexism Detection in Social Media

Aiqi Jiang, Arkaitz Zubiaga — Fri, 02 Jun 2023 00:00:00 -0700

The goal of sexism detection is to mitigate negative online content targeting certain gender groups of people. However, the limited availability of labeled sexism-related datasets makes it problematic to identify online sexism for low-resource languages. In this paper, we address the task of automatic sexism detection in social media for one low-resource language -- Chinese. Rather than collecting new sexism data or building cross-lingual transfer learning models, we develop a cross-lingual domain-aware semantic specialisation system in order to make the most of existing data. Semantic specialisation is a technique for retrofitting pre-trained distributional word vectors by integrating external linguistic knowledge (such as lexico-semantic relations) into the specialised feature space. To do this, we leverage semantic resources for sexism from a high-resource language (English) to specialise pre-trained word vectors in the target language (Chinese) to inject domain knowledge. We demonstrate the benefit of our sexist word embeddings (SexWEs) specialised by our framework via intrinsic evaluation of word similarity and extrinsic evaluation of sexism detection. Compared with other specialisation approaches and Chinese baseline word vectors, our SexWEs shows an average score improvement of 0.033 and 0.064 in both intrinsic and extrinsic evaluations, respectively. The ablative results and visualisation of SexWEs also prove the effectiveness of our framework on retrofitting word vectors in low-resource languages.

Retweet-BERT: Political Leaning Detection Using Language Features and Information Diffusion on Social Networks

Julie Jiang, Xiang Ren, Emilio Ferrara — Fri, 02 Jun 2023 00:00:00 -0700

Estimating the political leanings of social media users is a challenging and ever more pressing problem given the increase in social media consumption. We introduce Retweet-BERT, a simple and scalable model to estimate the political leanings of Twitter users. Retweet-BERT leverages the retweet network structure and the language used in users' profile descriptions. Our assumptions stem from patterns of networks and linguistics homophily among people who share similar ideologies. Retweet-BERT demonstrates competitive performance against other state-of-the-art baselines, achieving 96%-97% macro-F1 on two recent Twitter datasets (a COVID-19 dataset and a 2020 United States presidential elections dataset). We also perform manual validation to validate the performance of Retweet-BERT on users not in the training data. Finally, in a case study of COVID-19, we illustrate the presence of political echo chambers on Twitter and show that it exists primarily among right-leaning users. Our code is open-sourced and our data is publicly available.

Images, Emotions, and Credibility: Effect of Emotional Facial Expressions on Perceptions of News Content Bias and Source Credibility in Social Media

Alireza Karduni, Ryan Wesslen, Douglas Markant, Wenwen Dou — Fri, 02 Jun 2023 00:00:00 -0700

Images are an indispensable part of the news we consume. Highly emotional images from mainstream and misinformation sources can greatly influence our trust in the news. We present two studies on the effects of emotional facial images on users' perception of bias in news content and the credibility of sources. In study 1, we investigate the impact of repeated exposure to content with images containing positive or negative facial expressions on users’ judgements of source credibility and bias. In study 2, we focus on sources' systematic emotional portrayal of specific politicians. Our results show the presence of negative (angry) facial emotions can lead to perceptions of higher bias in content. We also find that systematic portrayal negative portrayal of different politicians leads to lower perceptions of source credibility. These results highlight how implicit visual propositions manifested by emotions in facial expressions might have a substantial effect on our trust in news.

InfluencerRank: Discovering Effective Influencers via Graph Convolutional Attentive Recurrent Neural Networks

Seungbae Kim, Jyun-Yu Jiang, Jinyoung Han, Wei Wang — Fri, 02 Jun 2023 00:00:00 -0700

As influencers play considerable roles in social media marketing, companies increase the budget for influencer marketing. Hiring effective influencers is crucial in social influencer marketing, but it is challenging to find the right influencers among hundreds of millions of social media users. In this paper, we propose InfluencerRank that ranks influencers by their effectiveness based on their posting behaviors and social relations over time. To represent the posting behaviors and social relations, the graph convolutional neural networks are applied to model influencers with heterogeneous networks during different historical periods. By learning the network structure with the embedded node features, InfluencerRank can derive informative representations for influencers at each period. An attentive recurrent neural network finally distinguishes highly effective influencers from other influencers by capturing the knowledge of the dynamics of influencer representations over time. Extensive experiments have been conducted on an Instagram dataset that consists of 18,397 influencers with their 2,952,075 posts published within 12 months. The experimental results demonstrate that InfluencerRank outperforms existing baseline methods. An in-depth analysis further reveals that all of our proposed features and model components are beneficial to discover effective influencers.

Popular Support for Balancing Equity and Efficiency in Resource Allocation: A Case Study in Online Advertising to Increase Welfare Program Awareness

Allison Koenecke, Eric Giannella, Robb Willer, Sharad Goel — Fri, 02 Jun 2023 00:00:00 -0700

Algorithmically optimizing the provision of limited resources is commonplace across domains from healthcare to lending. Optimization can lead to efficient resource allocation, but, if deployed without additional scrutiny, can also exacerbate inequality. Little is known about popular preferences regarding acceptable efficiency-equity trade-offs, making it difficult to design algorithms that are responsive to community needs and desires. Here we examine this trade-off and concomitant preferences in the context of GetCalFresh, an online service that streamlines the application process for California’s Supplementary Nutrition Assistance Program (SNAP, formerly known as food stamps). GetCalFresh runs online advertisements to raise awareness of their multilingual SNAP application service. We first demonstrate that when ads are optimized to garner the most enrollments per dollar, a disproportionately small number of Spanish speakers enroll due to relatively higher costs of non-English language advertising. Embedding these results in a survey (N = 1,532) of a diverse set of Americans, we find broad popular support for valuing equity in addition to efficiency: respondents generally preferred reducing total enrollments to facilitate increased enrollment of Spanish speakers. These results buttress recent calls to reevaluate the efficiency-centric paradigm popular in algorithmic resource allocation.

Personal History Affects Reference Points: A Case Study of Codeforces

Fri, 02 Jun 2023 00:00:00 -0700

Humans make decisions based on their internal value function, and its shape is known to be distorted and biased around a point, which the research community of behavior economics refers to as the reference point. People intensify activities that come to lie within the reach of their reference point, and abstain from acts that would incur losses once they've crossed the point. However, the impact of past experiences on decision making around the reference point has not been well studied. By analyzing a long series of user-level decisions gathered from a competitive programming website, we find that history has a clear impact on user's decision making around the reference point. Past experiences can strengthen, and sometimes weaken, the decision bias around the reference point. Experiences of past difficulties can strengthen the tendency towards loss aversion after achieving the reference point. When a person crosses a reference point for the first time, the cognitive decision bias is significant. However, repeating this crossing gradually weakens the effect. We also show the value of our insights in the task of predicting user behavior. Prediction models incorporating our insights may be used for motivating people to remain more active.

Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario

Fri, 02 Jun 2023 00:00:00 -0700

Characterizing the demographics of social media users enables a diversity of applications, from better targeting of policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content. We find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.

Associative Inference Can Increase People’s Susceptibility to Misinformation

Sian Lee, Haeseung Seo, Dongwon Lee, Aiping Xiong — Fri, 02 Jun 2023 00:00:00 -0700

Associative inference is an adaptive, constructive process of memory that allows people to link related information to make novel connections. We conducted three online human-subjects experiments investigating participants’ susceptibility to associatively inferred misinformation and its interaction with their cognitive ability and how news articles were presented. In each experiment, participants completed recognition and perceived accuracy rating tasks for the snippets of news articles in a tweet format across two phases. At Phase 1, participants viewed real news only. At Phase 2, participants viewed both real and fake news. Critically, we varied whether the fake news at Phase 2 was inferred from (i.e., associative inference), associated with (i.e., association only), or irrelevant to (i.e., control) the corresponding real news pairs at Phase 1. Both recognition and perceived accuracy results showed that participants in the associative inference condition were more susceptible to fake news than those in the other conditions. Furthermore, hashtags embedded within the tweets made the obtained effects evident only for the participants of higher cognitive ability. Our findings reveal that associative inference can be a basis for individuals’ susceptibility to misinformation, especially for those of higher cognitive ability. We conclude by discussing the implications of our results for understanding and mitigating misinformation on social media platforms.

Beyond Discrete Genres: Mapping News Items onto a Multidimensional Framework of Genre Cues

Zilin Lin, Kasper Welbers, Susan Vermeer, Damian Trilling — Fri, 02 Jun 2023 00:00:00 -0700

In the contemporary media landscape, with the vast and diverse supply of news, it is increasingly challenging to study such an enormous amount of items without a standardized framework. Although attempts have been made to organize and compare news items on the basis of news values, news genres receive little attention, especially the genres in a news consumer’s perception. Yet, perceived news genres serve as an essential component in exploring how news has developed, as well as a precondition for understanding media effects. We approach this concept by conceptualizing and operationalizing a non-discrete framework for mapping news items in terms of genre cues. As a starting point, we propose a preliminary set of dimensions consisting of “factuality” and “formality”. To automatically analyze a large amount of news items, we deliver two computational models for predicting news sentences in terms of the said two dimensions. Such predictions could then be used for locating news items within our framework. This proposed approach that positions news items upon a multidimensional grid helps deepening our insight into the evolving nature of news genres.

"Learn the Facts about COVID-19": Analyzing the Use of Warning Labels on TikTok Videos

Chen Ling, Krishna P. Gummadi, Savvas Zannettou — Fri, 02 Jun 2023 00:00:00 -0700

During the COVID-19 pandemic, health-related misinformation and harmful content shared online had a significant adverse effect on society. In an attempt to mitigate this adverse effect, mainstream social media platforms like Facebook, Twitter, and TikTok employed soft moderation interventions (i.e., warning labels) on potentially harmful posts. Such interventions aim to inform users about the post's content without removing it, hence easing the public's concerns about censorship and freedom of speech. Despite the recent popularity of these moderation interventions, as a research community, we lack empirical analyses aiming to uncover how these warning labels are used in the wild, particularly during challenging times like the COVID-19 pandemic. In this work, we analyze the use of warning labels on TikTok, focusing on COVID-19 videos. First, we construct a set of 26 COVID-19 related hashtags, and then we collect 41K videos that include those hashtags in their description. Second, we perform a quantitative analysis on the entire dataset to understand the use of warning labels on TikTok. Then, we perform an in-depth qualitative study, using thematic analysis, on 222 COVID-19 related videos to assess the content and the connection between the content and the warning labels. Our analysis shows that TikTok broadly applies warning labels on TikTok videos, likely based on hashtags included in the description (e.g., 99% of the videos that contain #coronavirus have warning labels). More worrying is the addition of COVID-19 warning labels on videos where their actual content is not related to COVID-19 (23% of the cases in a sample of 143 English videos that are not related to COVID-19). Finally, our qualitative analysis on a sample of 222 videos shows that 7.7% of the videos share misinformation/harmful content and do not include warning labels, 37.3% share benign information and include warning labels, and that 35% of the videos that share misinformation/harmful content (and need a warning label) are made for fun. Our study demonstrates the need to develop more accurate and precise soft moderation systems, especially on a platform like TikTok that is extremely popular among people of younger age.

Improving Mental Health Classifier Generalization with Pre-diagnosis Data

Yujian Liu, Laura Biester, Rada Mihalcea — Fri, 02 Jun 2023 00:00:00 -0700

Recent work has shown that classifiers for depression detection often fail to generalize to new datasets. Most NLP models for this task are built on datasets that use textual reports of a depression diagnosis (e.g., statements on social media) to identify diagnosed users; this approach allows for collection of large-scale datasets, but leads to poor generalization to out-of-domain data. Notably, models tend to capture features that typify direct discussion of mental health rather than more subtle indications of depression symptoms. In this paper, we explore the hypothesis that building classifiers using exclusively social media posts from before a user's diagnosis will lead to less reliance on shortcuts and better generalization. We test our classifiers on a dataset that is based on an external survey rather than textual self-reports, and find that using pre-diagnosis data for training yields improved performance with many types of classifiers.

Team Resilience under Shock: An Empirical Analysis of GitHub Repositories during Early COVID-19 Pandemic

Xuan Lu, Wei Ai, Yixin Wang, Qiaozhu Mei — Fri, 02 Jun 2023 00:00:00 -0700

While many organizations have shifted to working remotely during the COVID-19 pandemic, how the remote workforce and the remote teams are influenced by and would respond to this and future shocks remain largely unknown. Software developers have relied on remote collaborations long before the pandemic, working in virtual teams (GitHub repositories). The dynamics of these repositories through the pandemic provide a unique opportunity to understand how remote teams react under shock. This work presents a systematic analysis. We measure the overall effect of the early pandemic on public GitHub repositories by comparing their sizes and productivity with the counterfactual outcomes forecasted as if there were no pandemic. We find that the productivity level and the number of active members of these teams vary significantly during different periods of the pandemic. We then conduct a finer-grained investigation and study the heterogeneous effects of the shock on individual teams. We find that the resilience of a team is highly correlated to certain properties of the team before the pandemic. Through a bootstrapped regression analysis, we reveal which types of teams are robust or fragile to the shock.

Contextualizing Online Conversational Networks

Thomas Magelinski, Kathleen M. Carley — Fri, 02 Jun 2023 00:00:00 -0700

Online social connections occur within a specific conversational context. Prior work in network analysis of social media data attempts to contextualize data through filtering. We propose a method of contextualizing online conversational connections automatically and illustrate this method with Twitter data. Specifically, we detail a graph neural network model capable of representing tweets in a vector space based on their text, hashtags, URLs, and neighboring tweets. Once tweets are represented, clusters of tweets uncover conversational contexts. We apply our method to a dataset with 4.5 million tweets discussing the 2020 US election. We find that even filtered data contains many different conversational contexts, with users engaging in multiple conversations. While users engage in multiple conversations, the overlap between any two pairs of conversations tends to be only 30-40%, giving very different networks for different conversations. Even accounting for this variation, we show that the relative social status of users varies considerably across contexts, with tau=0.472 on average. Our findings imply that standard network analysis on social media data can be unreliable in the face of multiple conversational contexts.

Comfort Foods and Community Connectedness: Investigating Diet Change during COVID-19 Using YouTube Videos on Twitter

Yelena Mejova, Lydia Manikonda — Fri, 02 Jun 2023 00:00:00 -0700

Unprecedented lockdowns at the start of the COVID-19 pandemic have drastically changed the routines of millions of people, potentially impacting important health-related behaviors. In this study, we use YouTube videos embedded in tweets about diet, exercise and fitness posted before and during COVID-19 to investigate the influence of the pandemic lockdowns on diet and nutrition. In particular, we examine the nutritional profile of the foods mentioned in the transcript, description and title of each video in terms of six macronutrients (protein, energy, fat, sodium, sugar, and saturated fat). These macronutrient values were further linked to demographics to assess if there are specific effects on those potentially having insufficient access to healthy sources of food. Interrupted time series analysis revealed a considerable shift in the aggregated macronutrient scores before and during COVID-19. In particular, whereas areas with lower incomes showed decrease in energy, fat, and saturated fat, those with higher percentage of African Americans showed an elevation in sodium. Word2Vec word similarities and odds ratio analysis suggested a shift from popular diets and lifestyle bloggers before the lockdowns to the interest in a variety of healthy foods, communal sharing of quick and easy recipes, as well as a new emphasis on comfort foods. To the best of our knowledge, this work is novel in terms of linking attention signals in tweets, content of videos, their nutrients profile, and aggregate demographics of the users. The insights made possible by this combination of resources are important for monitoring the secondary health effects of social distancing, and informing social programs designed to alleviate these effects.

Authority without Care: Moral Values behind the Mask Mandate Response

Yelena Mejova, Kyriaki Kalimeri, Gianmarco De Francisci Morales — Fri, 02 Jun 2023 00:00:00 -0700

Face masks are one of the cheapest and most effective non-pharmaceutical interventions available against airborne diseases such as COVID-19. Unfortunately, they have been met with resistance by a substantial fraction of the populace, especially in the U.S. In this study, we uncover the latent moral values that underpin the response to the mask mandate, and paint them against the country's political backdrop. We monitor the discussion about masks on Twitter, which involves almost 600k users in a time span of 7 months. By using a combination of graph mining, natural language processing, topic modeling, content analysis, and time series analysis, we characterize the responses to the mask mandate of both those in favor and against them. We base our analysis on the theoretical frameworks of Moral Foundation Theory and Hofstede's cultural dimensions. Our results show that, while the anti-mask stance is associated with a conservative political leaning, the moral values expressed by its adherents diverge from the ones typically used by conservatives. In particular, the expected emphasis on the values of authority and purity is accompanied by an atypical dearth of in-group loyalty. We find that after the mandate, both pro- and anti-mask sides decrease their emphasis on care about others, and increase their attention on authority and fairness, further politicizing the issue. In addition, the mask mandate reverses the expression of Individualism-Collectivism between the two sides, with an increase of individualism in the anti-mask narrative, and a decrease in the pro-mask one. We argue that monitoring the dynamics of moral positioning is crucial for designing effective public health campaigns that are sensitive to the underlying values of the target audience.

Bridging Nations: Quantifying the Role of Multilinguals in Communication on Social Media

Julia Mendelsohn, Sayan Ghosh, David Jurgens, Ceren Budak — Fri, 02 Jun 2023 00:00:00 -0700

Social media enables the rapid spread of many kinds of information, from pop culture memes to social movements. However, little is known about how information crosses linguistic boundaries. We apply causal inference techniques on the European Twitter network to quantify the structural role and communication influence of multilingual users in cross-lingual information exchange. Overall, multilinguals play an essential role; posting in multiple languages increases betweenness centrality by 13%, and having a multilingual network neighbor increases monolinguals’ odds of sharing domains and hashtags from another language 16-fold and 4-fold, respectively. We further show that multilinguals have a greater impact on diffusing information is less accessible to their monolingual compatriots, such as information from far-away countries and content about regional politics, nascent social movements, and job opportunities. By highlighting information exchange across borders, this work sheds light on a crucial component of how information and ideas spread around the world.

Information Operations in Turkey: Manufacturing Resilience with Free Twitter Accounts

Maya Merhi, Sarah Rajtmajer, Dongwon Lee — Fri, 02 Jun 2023 00:00:00 -0700

Following the 2016 US elections Twitter launched their Information Operations (IO) hub where they archive account activity connected to state linked information operations. In June 2020, Twitter took down and released a set of accounts linked to Turkey's ruling political party (AKP). We investigate these accounts in the aftermath of the takedown to explore whether AKP-linked operations are ongoing and to understand the strategies they use to remain resilient to disruption. We collect live accounts that appear to be part of the same network, ~30% of which have been suspended by Twitter since our collection. We create a BERT-based classifier that shows similarity between these two networks, develop a taxonomy to categorize these accounts, find direct sequel accounts between the Turkish takedown and the live accounts, and find evidence that Turkish IO actors deliberately construct their network to withstand large-scale shutdown by utilizing explicit and implicit signals of coordination. We compare our findings from the Turkish operation to Russian and Chinese IO on Twitter and find that Turkey's IO utilizes a unique group structure to remain resilient. Our work highlights the fundamental imbalance between IO actors quickly and easily creating free accounts and the social media platforms spending significant resources on detection and removal, and contributes novel findings about Turkish IO on Twitter.

"This Is Fake News": Characterizing the Spontaneous Debunking from Twitter Users to COVID-19 False Information

Fri, 02 Jun 2023 00:00:00 -0700

False information spreads on social media, and fact-checking is a potential countermeasure. However, there is a severe shortage of fact-checkers; an efficient way to scale fact-checking is desperately needed, especially in pandemics like COVID-19. In this study, we focus on spontaneous debunking by social media users, which has been missed in existing research despite its indicated usefulness for fact-checking and countering false information. Specifically, we characterize the tweets with false information, or fake tweets, that tend to be debunked and Twitter users who often debunk fake tweets. For this analysis, we create a comprehensive dataset of responses to fake tweets, annotate a subset of them, and build a classification model for detecting debunking behaviors. We find that most fake tweets are left undebunked, spontaneous debunking is slower than other forms of responses, and spontaneous debunking exhibits partisanship in political topics. These results provide actionable insights into utilizing spontaneous debunking to scale conventional fact-checking, thereby supplementing existing research from a new perspective.

Echo Tunnels: Polarized News Sharing Online Runs Narrow but Deep

Lillio Mok, Michael Inzlicht, Ashton Anderson — Fri, 02 Jun 2023 00:00:00 -0700

Online social platforms afford users vast digital spaces to share and discuss current events. However, scholars have concerns both over their role in segregating information exchange into ideological echo chambers, and over evidence that these echo chambers are nonetheless over-stated. In this work, we investigate news-sharing patterns across the entirety of Reddit and find that the platform appears polarized macroscopically, especially in politically right-leaning spaces. On closer examination, however, we observe that the majority of this effect originates from small, hyper-partisan segments of the platform accounting for a minority of news shared. We further map the temporal evolution of polarized news sharing and uncover evidence that, in addition to having grown drastically over time, polarization in hyper-partisan communities also began much earlier than 2016 and is resistant to Reddit's largest moderation event. Our results therefore suggest that social polarized news sharing runs narrow but deep online. Rather than being guided by the general prevalence or absence of echo chambers, we argue that platform policies are better served by measuring and targeting the communities in which ideological segregation is strongest.

The Chance of Winning Election Impacts on Social Media Strategy

Fri, 02 Jun 2023 00:00:00 -0700

Social media has been a paramount arena for election campaigns for political actors. While many studies have been paying attention to the political campaigns related to partisanship, politicians also can conduct different campaigns according to their chances of winning. Leading candidates, for example, do not behave the same as fringe candidates in their elections, and vice versa. We, however, know little about this difference in social media political campaign strategies according to their odds in elections. We tackle this problem by analyzing candidates' tweets in terms of users, topics, and sentiment of replies. Our study finds that, as their chances of winning increase, candidates narrow the targets they communicate with, from people in general to the electrical districts and specific persons (verified accounts or accounts with many followers). Our study brings new insights into the candidates' campaign strategies through the analysis based on the novel perspective of the candidate's electoral situation.

BotBuster: Multi-Platform Bot Detection Using a Mixture of Experts

Lynnette Hui Xian Ng, Kathleen M. Carley — Fri, 02 Jun 2023 00:00:00 -0700

Despite rapid development, current bot detection models still face challenges in dealing with incomplete data and cross-platform applications. In this paper, we propose BotBuster, a social bot detector built with the concept of a mixture of experts approach. Each expert is trained to analyze a portion of account information, e.g. username, and are combined to estimate the probability that the account is a bot. Experiments on 10 Twitter datasets show that BotBuster outperforms popular bot-detection baselines (avg F1=73.54 vs avg F1=45.12). This is accompanied with F1=60.04 on a Reddit dataset and F1=60.92 on an external evaluation set. Further analysis shows that only 36 posts is required for a stable bot classification. Investigation shows that bot post features have changed across the years and can be difficult to differentiate from human features, making bot detection a difficult and ongoing problem.

"Dummy Grandpa, Do You Know Anything?": Identifying and Characterizing Ad Hominem Fallacy Usage in the Wild

Utkarsh Patel, Animesh Mukherjee, Mainack Mondal — Fri, 02 Jun 2023 00:00:00 -0700

Today, participating in discussions on online forums is extremely commonplace and these discussions have started rendering a strong influence on the overall opinion of online users. Naturally, twisting the flow of the argument can have a strong impact on the minds of naive users, which in the long run might have socio-political ramifications, for example, winning an election or spreading targeted misinformation. Thus, these platforms are potentially highly vulnerable to malicious players who might act individually or as a cohort to breed fallacious arguments with a motive to sway public opinion. Ad hominem arguments are one of the most effective forms of such fallacies. Although a simple fallacy, it is effective enough to sway public debates in offline world and can be used as a precursor to shutting down the voice of opposition by slander. In this work, we take a first step in shedding light on the usage of ad hominem fallacies in the wild. First, we build a powerful ad hominem detector based on transformer architecture with high accuracy (F1 more than 83%, showing a significant improvement over prior work), even for datasets for which annotated instances constitute a very small fraction. We then used our detector on 265k arguments collected from the online debate forum – CreateDebate. Our crowdsourced surveys validate our in-the-wild predictions on CreateDebate data (94% match with manual annotation). Our analysis revealed that a surprising 31.23% of CreateDebate content contains ad hominem fallacy, and a cohort of highly active users post significantly more ad hominem to suppress opposing views. Then, our temporal analysis revealed that ad hominem argument usage increased significantly since the 2016 US Presidential election, not only for topics like Politics, but also for Science and Law. We conclude by discussing important implications of our work to detect and defend against ad hominem fallacies.

On the Relation between Opinion Change and Information Consumption on Reddit

Fri, 02 Jun 2023 00:00:00 -0700

While much attention has been devoted to the causes of opinion change, little is known about its consequences. Our study moves a first step in this direction by looking at Reddit, and in particular to the subreddit r/ChangeMyView, a community dedicated to debating one’s own opinions on a wide array of topics. We analyze changes in online information consumption behavior that arise after a self-reported opinion change, by looking at the participation to a set of sociopolitical communities. We find that people who self-report an opinion change are significantly more likely to change their future participation in a specific subset of those communities. Specifically, there is a significant association (Pearson r = 0.46) between using propaganda-like language in a community and the increase in chances of leaving it. Comparable results (Pearson r = 0.39) hold for the opposite direction, i.e., joining these same communities. In addition, the textual content of the post associated with opinion change is indicative of which communities will be joined or left: a predictive model based only on the text of this post can pinpoint these communities with an average precision@5 of 0.20. Our results establish a link between opinion change and information consumption, and highlight how online propagandistic communities act as a first gateway to internalize a shift in one’s sociopolitical opinion.

This Sample Seems to Be Good Enough! Assessing Coverage and Temporal Reliability of Twitter’s Academic API

Fri, 02 Jun 2023 00:00:00 -0700

Because of its willingness to share data with academia and industry, Twitter has been the primary social media platform for scientific research as well as for consulting businesses and governments in the last decade. In recent years, a series of publications have studied and criticized Twitter's APIs and Twitter has partially adapted its existing data streams. The newest Twitter API for Academic Research allows to "access Twitter's real-time and historical public data with additional features and functionality that support collecting more precise, complete, and unbiased datasets. The main new feature of this API is the possibility of accessing the full archive of all historic Tweets. In this article, we will take a closer look at the Academic API and will try to answer two questions. First, are the datasets collected with the Academic API complete? Secondly, since Twitter's Academic API delivers historic Tweets as represented on Twitter at the time of data collection, we need to understand how much data is lost over time due to Tweet and account removal from the platform. Our work shows evidence that Twitter's Academic API can indeed create (almost) complete samples of Twitter data based on a wide variety of search terms. We also provide evidence that Twitter's data endpoint v2 delivers better samples than the previously used endpoint v1.1. Furthermore, collecting Tweets with the Academic API at the time of studying a phenomenon rather than creating local archives of stored Tweets, allows for a straightforward way of following Twitter's developer agreement. Finally, we will also discuss technical artifacts and implications of the Academic API. We hope that our work can add another layer of understanding of Twitter data collections leading to more reliable studies of human behavior via social media data.

The Geometry of Misinformation: Embedding Twitter Networks of Users Who Spread Fake News in Geometrical Opinion Spaces

Pedro Ramaciotti Morales, Manon Berriche, Jean-Philippe Cointet — Fri, 02 Jun 2023 00:00:00 -0700

To understand why internet users spread fake news online, many studies have focused on individual drivers, such as cognitive skills, media literacy, or demographics. Recent findings have also shown the role of complex socio-political dynamics, highlighting that political polarization and ideologies are closely linked to a propensity to participate in the dissemination of fake news. Most of the existing empirical studies have focused on the US example by exploiting the self-reported or solicited positioning of users on a dichotomous scale opposing liberals with conservatives. Yet, left-right polarization alone is insufficient to study socio-political dynamics when considering non binary and multi-dimensional party systems, in which relevant ideological stances must be characterized in additional dimensions, relating for example to opposition to elites, government, political parties or mainstream media. In this article we leverage ideological embeddings of Twitter networks in France in multi-dimensional opinions spaces, where dimensions stand for attitudes towards different issues, and we trace the positions of users who shared articles that were rated as misinformation by fact-checkers. In multi-dimensional settings, and in contrast with the US, opinion dimensions capturing attitudes towards elites are more predictive of whether a user shares misinformation. Most users sharing misinformation hold salient anti-elite sentiments and, among them, more so those with radical left- and right-leaning stances. Our results reinforce the importance of enriching one-dimensional left-right analyses, showing that other ideological dimensions, such as anti-elite sentiment, are critical when characterizing users who spread fake news. This lends support to emerging accounts of social drivers of misinformation through political polarization, but also stresses the role of the entanglement between fake news, anti-elite polarization, and the role of scientific authorities in public debate.

Spillover of Antisocial Behavior from Fringe Platforms: The Unintended Consequences of Community Banning

Giuseppe Russo, Luca Verginer, Manoel Horta Ribeiro, Giona Casiraghi — Fri, 02 Jun 2023 00:00:00 -0700

Online platforms face pressure to keep their communities civil and respectful. Thus, banning problematic online communities from mainstream platforms is often met with enthusiastic public reactions. However, this policy can lead users to migrate to alternative fringe platforms with lower moderation standards and may reinforce antisocial behaviors. As users of these communities often remain co-active across mainstream and fringe platforms, antisocial behaviors may spill over onto the mainstream platform. We study this possible spillover by analyzing 70,000 users from three banned communities that migrated to fringe platforms: r/The_Donald, r/GenderCritical, and r/Incels. Using a difference-in-differences design, we contrast co-active users with matched counterparts to estimate the causal effect of fringe platform participation on users' antisocial behavior on Reddit. Our results show that participating in the fringe communities increases users' toxicity on Reddit (as measured by Perspective API) and involvement with subreddits similar to the banned community---which often also breach platform norms. The effect intensifies with time and exposure to the fringe platform. In short, we find evidence for a spillover of antisocial behavior from fringe platforms onto Reddit via co-participation.

Cross-Lingual and Cross-Domain Crisis Classification for Low-Resource Scenarios

Cinthia Sánchez, Hernan Sarmiento, Andres Abeliuk, Jorge Pérez, Barbara Poblete — Fri, 02 Jun 2023 00:00:00 -0700

Social media data has emerged as a useful source of timely information about real-world crisis events. One of the main tasks related to the use of social media for disaster management is the automatic identification of crisis-related messages. Most of the studies on this topic have focused on the analysis of data for a particular type of event in a specific language. This limits the possibility of generalizing existing approaches because models cannot be directly applied to new types of events or other languages. In this work, we study the task of automatically classifying messages that are related to crisis events by leveraging cross-language and cross-domain labeled data. Our goal is to make use of labeled data from high-resource languages to classify messages from other (low-resource) languages and/or of new (previously unseen) types of crisis situations. For our study we consolidated from the literature a large unified dataset containing multiple crisis events and languages. Our empirical findings show that it is indeed possible to leverage data from crisis events in English to classify the same type of event in other languages, such as Spanish and Italian (80.0% F1-score). Furthermore, we achieve good performance for the cross-domain task (80.0% F1-score) in a cross-lingual setting. Overall, our work contributes to improving the data scarcity problem that is so important for multilingual crisis classification. In particular, mitigating cold-start situations in emergency events, when time is of essence.

How Much User Context Do We Need? Privacy by Design in Mental Health NLP Applications

Ramit Sawhney, Atula Neerkaje, Ivan Habernal, Lucie Flek — Fri, 02 Jun 2023 00:00:00 -0700

Clinical NLP tasks such as mental health assessment from text, must take social constraints into account - the performance maximization must be constrained by the utmost importance of guaranteeing privacy of user data. Consumer protection regulations, such as GDPR, generally handle privacy by restricting data availability, such as requiring to limit user data to 'what is necessary' for a given purpose. In this work, we reason that providing stricter formal privacy guarantees, while increasing the volume of user data in the model, in most cases increases benefit for all parties involved, especially for the user. We demonstrate our arguments on two existing suicide risk assessment datasets of Twitter and Reddit posts. We present the first analysis juxtaposing user history length and differential privacy budgets and elaborate how modeling additional user context enables utility preservation while maintaining acceptable user privacy guarantees.

Effects of Algorithmic Trend Promotion: Evidence from Coordinated Campaigns in Twitter’s Trending Topics

Joseph Schlessinger, Kiran Garimella, Maurice Jakesch, Dean Eckles — Fri, 02 Jun 2023 00:00:00 -0700

In addition to more personalized content feeds, some leading social media platforms give a prominent role to content that is more widely popular. On Twitter, "trending topics" identify popular topics of conversation on the platform, thereby promoting popular content which users might not have otherwise seen through their network. Hence, "trending topics" potentially play important roles in influencing the topics users engage with on a particular day. Using two carefully constructed data sets from India and Turkey, we study the effects of a hashtag appearing on the trending topics page on the number of tweets produced with that hashtag. We specifically aim to answer the question: How many new tweeting using that hashtag appear because a hashtag is labeled as trending? We distinguish the effects of the trending topics page from network exposure and find there is a statistically significant, but modest, return to a hashtag being featured on trending topics. Analysis of the types of users impacted by trending topics shows that the feature helps less popular and new users to discover and spread content outside their network, which they otherwise might not have been able to do.

Detecting Anti-vaccine Users on Twitter

Matheus Schmitz, Goran Muric, Keith Burghardt — Fri, 02 Jun 2023 00:00:00 -0700

Vaccine hesitancy, which has recently been driven by online narratives, significantly degrades the efficacy of vaccination strategies, such as those for COVID-19. Despite broad agreement in the medical community about the safety and efficacy of available vaccines, a large number of social media users continue to be inundated with false information about vaccines and are indecisive or unwilling to be vaccinated. The goal of this study is to better understand anti-vaccine sentiment by developing a system capable of automatically identifying the users responsible for spreading anti-vaccine narratives. We introduce a publicly available Python package capable of analyzing Twitter profiles to assess how likely that profile is to share anti-vaccine sentiment in the future. The software package is built using text embedding methods, neural networks, and automated dataset generation and is trained on several million tweets. We find this model can accurately detect anti-vaccine users up to a year before they tweet anti-vaccine hashtags or keywords. We also show examples of how text analysis helps us understand anti-vaccine discussions by detecting moral and emotional differences between anti-vaccine spreaders on Twitter and regular users. Our results will help researchers and policy-makers understand how users become anti-vaccine and what they discuss on Twitter. Policy-makers can utilize this information for better targeted campaigns that debunk harmful anti-vaccination myths.

Cybersecurity Misinformation Detection on Social Media: Case Studies on Phishing Reports and Zoom’s Threat

Mohit Singhal, Nihal Kumarswamy, Shreyasi Kinhekar, Shirin Nilizadeh — Fri, 02 Jun 2023 00:00:00 -0700

Prior work has extensively studied misinformation related to news, politics, and health, however, misinformation can also be about technological topics. While less controversial, such misinformation can severely impact companies’ reputations and revenues, and users’ online experiences. Recently, social media has also been increasingly used as a novel source of knowledgebase for extracting timely and relevant security threats, which are fed to the threat intelligence systems for better performance. However, with possible campaigns spreading false security threats, these systems can become vulnerable to poisoning attacks. In this work, we proposed novel approaches for detecting misinformation about cybersecurity and privacy threats on social media, focusing on two topics with different types of misinformation: phishing websites and Zoom’s security & privacy threats. We developed a framework for detecting inaccurate phishing claims on Twitter. Using this framework, we could label about 9% of URLs and 22% of phishing reports as misinformation. We also proposed another framework for detecting misinformation related to Zoom’s security and privacy threats on multiple platforms. Our classifiers showed great performance with more than 98% accuracy. Employing these classifiers on the posts from Facebook, Instagram, Reddit, and Twitter, we found respectively that about 18%, 3%, 4%, and 3% of posts were misinformation. In addition, we studied the characteristics of misinformation posts, their authors, and their timelines, which helped us identify campaigns.

Characterizing and Identifying Socially Shared Self-Descriptions in Product Reviews

Lu Sun, F. Maxwell Harper, Chia-Jung Lee, Vanessa Murdock, Barbara Poblete — Fri, 02 Jun 2023 00:00:00 -0700

Online e-commerce product reviews can be highly influential in a customer's decision-making processes. Reviews often describe personal experiences with a product and provide candid opinions about a product's pros and cons. In some cases, reviewers choose to share information about themselves, just as they might do in social platforms. These descriptions are a valuable source of information about who finds a product most helpful. Customers benefit from key insights about a product from people with their same interests and sellers might use the information to better serve their customers needs. In this work, we present a comprehensive look into voluntary self-descriptive information found in public customer reviews. We analyzed what people share about themselves and how this contributes to their product opinions. We developed a taxonomy of types of self-descriptions, and a machine-learned classification model of reviews according to this taxonomy. We present new quantitative findings, and a thematic study of the perceived purpose descriptions in reviews.

Social Influence-Maximizing Group Recommendation

Yangke Sun, Bogdan Cautis, Silviu Maniu — Fri, 02 Jun 2023 00:00:00 -0700

In this paper, we revisit the group recommendation problem, by taking into consideration the information diffusion in a social network, as one of the main criteria that must be maximised. While the well-known influence maximization problem has the objective to select k users (spread seeds) from a social network, so that a piece of information can spread to the largest possible number of people in the network, in our setting the seeds are known (given as a group), and we must decide which k items (pieces of information) should be recommended to them. Therefore, the recommended items should at the same time be the best match for that group's preferences, and have the potential to spread as much as possible in an underlying diffusion network, to which the group members (the seeds) belong. This problem is directly motivated by group recommendation scenarios where social networking is an inherent dimension that must be taken into account when assessing the potential impact of a certain recommendation. We present the model and formulate the problem of influence-aware group recommendation as a multiple objective optimization problem. We then describe a greedy approach for this problem and we design an optimisation approach, by adapting the top-k algorithms NRA and TA. We evaluate all these methods experimentally, in three different recommendation scenarios, for movie, micro-blog and book recommendations, based on real-world datasets from Flixster, Twitter, and Douban respectively. Unsurprisingly, with the introduction of information diffusion as an optimization criterion for group recommendation, the recommendation problem becomes more complex. However, we show that our algorithms enable spread efficiency without loss of recommendation precision, under reasonable latency.

Top-Down Influence? Predicting CEO Personality and Risk Impact from Speech Transcripts

Kilian Theil, Dirk Hovy, Heiner Stuckenschmidt — Fri, 02 Jun 2023 00:00:00 -0700

How much does a CEO’s personality impact the performanceof their company? Management theory posits a great influence, but it is difficult to show empirically—there is a lack of publicly available self-reported personality data of top managers. Instead, we propose a text-based personality regressor based on crowd-sourced Myers–Briggs Type Indicator (MBTI) assessments. The ratings have a high internal and external validity and can be predicted with moderate to strong correlations for three out of four dimensions. Providing evidence for the upper echelons theory, we demonstrate that the predicted CEO personalities have explanatory power of financial risk.

Identifying Influential Brokers on Social Media from Social Network Structure

Sho Tsugawa, Kohei Watabe — Fri, 02 Jun 2023 00:00:00 -0700

Identifying influencers in a given social network has become an important research problem for various applications, including accelerating the spread of information in viral marketing and preventing the spread of fake news and rumors. The literature contains a rich body of studies on identifying influential source spreaders who can spread their own messages to many other nodes. In contrast, the identification of influential brokers who can spread other nodes' messages to many nodes has not been fully explored. Theoretical and empirical studies suggest that involvement of both influential source spreaders and brokers is a key to facilitating large-scale information diffusion cascades. Therefore, this paper explores ways to identify influential brokers from a given social network. By using three social media datasets, we investigate the characteristics of influential brokers by comparing them with influential source spreaders and central nodes obtained from centrality measures. Our results show that (i) most of the influential source spreaders are not influential brokers (and vice versa) and (ii) the overlap between central nodes and influential brokers is small (less than 15%) in Twitter datasets. We also tackle the problem of identifying influential brokers from centrality measures and node embeddings, and we examine the effectiveness of social network features in the broker identification task. Our results show that (iii) although a single centrality measure cannot characterize influential brokers well, prediction models using node embedding features achieve F1 scores of 0.35--0.68, suggesting the effectiveness of social network features for identifying influential brokers.

A Multi-Task Model for Sentiment Aided Stance Detection of Climate Change Tweets

Apoorva Upadhyaya, Marco Fisichella, Wolfgang Nejdl — Fri, 02 Jun 2023 00:00:00 -0700

Climate change has become one of the biggest challenges of our time. Social media platforms such as Twitter play an important role in raising public awareness and spreading knowledge about the dangers of the current climate crisis. With the increasing number of campaigns and communication about climate change through social media, the information could create more awareness and reach the general public and policy makers. However, these Twitter communications lead to polarization of beliefs, opinion-dominated ideologies, and often a split into two communities of climate change deniers and believers. In this paper, we propose a framework that helps identify denier statements on Twitter and thus classifies the stance of the tweet into one of the two attitudes towards climate change (denier/believer). The sentimental aspects of Twitter data on climate change are deeply rooted in general public attitudes toward climate change. Therefore, our work focuses on learning two closely related tasks: Stance Detection and Sentiment Analysis of climate change tweets. We propose a multi-task framework that performs stance detection (primary task) and sentiment analysis (auxiliary task) simultaneously. The proposed model incorporates the feature-specific and shared-specific attention frameworks to fuse multiple features and learn the generalized features for both tasks. The experimental results show that the proposed framework increases the performance of the primary task, i.e., stance detection by benefiting from the auxiliary task, i.e., sentiment analysis compared to its uni-modal and single-task variants.

An Open-Source Cultural Consensus Approach to Name-Based Gender Classification

Ian Van Buskirk, Aaron Clauset, Daniel B. Larremore — Fri, 02 Jun 2023 00:00:00 -0700

Name-based gender classification has enabled hundreds of otherwise infeasible scientific studies of gender. Yet, the lack of standardization, reliance on paid services, understudied limitations, and conceptual debates cast a shadow over many applications. To address these problems we develop and evaluate an ensemble-based open-source method built on publicly available data of empirical name-gender associations. Our method integrates 36 distinct sources—spanning over 150 countries and more than a century—via a meta-learning algorithm inspired by Cultural Consensus Theory (CCT). We also construct a taxonomy with which names themselves can be classified. We find that our method's performance is competitive with paid services and that our method, and others, approach the upper limits of performance; we show that conditioning estimates on additional metadata (e.g. cultural context), further combining methods, or collecting additional name-gender association data is unlikely to meaningfully improve performance. This work definitively shows that name-based gender classification can be a reliable part of scientific research and provides a pair of tools, a classification method and a taxonomy of names, that realize this potential.

Reddit in the Time of COVID

Veniamin Veselovsky, Ashton Anderson — Fri, 02 Jun 2023 00:00:00 -0700

When the COVID-19 pandemic hit, much of life moved online. Platforms of all types reported surges of activity, and people remarked on the various important functions that online platforms suddenly fulfilled. However, researchers lack a rigorous understanding of the pandemic's impacts on social platforms---and whether they were temporary or long-lasting. We present a conceptual framework for studying the large-scale evolution of social platforms and apply it to the study of Reddit's history, with a particular focus on the COVID-19 pandemic. We study platform evolution through two key dimensions: structure vs. content and macro- vs. micro-level analysis. Structural signals help us quantify how much behavior changed, while content analysis clarifies exactly how it changed. Applying these at the macro-level illuminates platform-wide changes, while at the micro-level we study impacts on individual users. We illustrate the value of this approach by showing the extraordinary and ordinary changes Reddit went through during the pandemic. First, we show that typically when rapid growth occurs, it is driven by a few concentrated communities and within a narrow slice of language use. However, Reddit's growth throughout COVID-19 was spread across disparate communities and languages. Second, all groups were equally affected in their change of interest, but veteran users tended to invoke COVID-related language more than newer users. Third, the new wave of users that arrived following COVID-19 was fundamentally different from previous cohorts of new users in terms of interests, activity, and likelihood of staying active on the platform. These findings provide a more rigorous understanding of how an online platform changed during the global pandemic.

Identifying and Characterizing Behavioral Classes of Radicalization within the QAnon Conspiracy on Twitter

Emily L. Wang, Luca Luceri, Francesco Pierri, Emilio Ferrara — Fri, 02 Jun 2023 00:00:00 -0700

Social media provide a fertile ground where conspiracy theories and radical ideas can flourish, reach broad audiences, and sometimes lead to hate or violence beyond the online world itself. QAnon represents a notable example of a political conspiracy that started out on social media but turned mainstream, in part due to public endorsement by influential political figures. Nowadays, QAnon conspiracies often appear in the news, are part of political rhetoric, and are espoused by significant swaths of people in the United States. It is therefore crucial to understand how such a conspiracy took root online, and what led so many social media users to adopt its ideas. In this work, we propose a framework that exploits both social interaction and content signals to uncover evidence of user radicalization or support for QAnon. Leveraging a large dataset of 240M tweets collected in the run-up to the 2020 US Presidential election, we define and validate a multivariate metric of radicalization. We use that to separate users in distinct, naturally-emerging, classes of behaviors associated with radicalization processes, from self-declared QAnon supporters to hyper-active conspiracy promoters. We also analyze the impact of Twitter's moderation policies on the interactions among different classes: we discover aspects of moderation that succeed, yielding a substantial reduction in the endorsement received by hyperactive QAnon accounts. But we also uncover where moderation fails, showing how QAnon content amplifiers are not deterred or affected by the Twitter intervention. Our findings refine our understanding of online radicalization processes, reveal effective and ineffective aspects of moderation, and call for the need to further investigate the role social media play in the spread of conspiracies.

AnnoBERT: Effectively Representing Multiple Annotators’ Label Choices to Improve Hate Speech Detection

Wenjie Yin, Vibhor Agarwal, Aiqi Jiang, Arkaitz Zubiaga, Nishanth Sastry — Fri, 02 Jun 2023 00:00:00 -0700

Supervised machine learning approaches often rely on a "ground truth" label. However, obtaining one label through majority voting ignores the important subjectivity information in tasks such hate speech detection. Existing neural network models principally regard labels as categorical variables, while ignoring the semantic information in diverse label texts. In this paper, we propose AnnoBERT, a first-of-its-kind architecture integrating annotator characteristics and label text with a transformer-based model to detect hate speech, with unique representations based on each annotator's characteristics via Collaborative Topic Regression (CTR) and integrate label text to enrich textual representations. During training, the model associates annotators with their label choices given a piece of text; during evaluation, when label information is not available, the model predicts the aggregated label given by the participating annotators by utilising the learnt association. The proposed approach displayed an advantage in detecting hate speech, especially in the minority class and edge cases with annotator disagreement. Improvement in the overall performance is the largest when the dataset is more label-imbalanced, suggesting its practical value in identifying real-world hate speech, as the volume of hate speech in-the-wild is extremely small on social media, when compared with normal (non-hate) speech. Through ablation studies, we show the relative contributions of annotator embeddings and label text to the model performance, and tested a range of alternative annotator embeddings and label text combinations.

Unique in What Sense? Heterogeneous Relationships between Multiple Types of Uniqueness and Popularity in Music

Yulin Yu, Pui Yin Cheung, Yong-Yeol Ahn, Paramveer S. Dhillon — Fri, 02 Jun 2023 00:00:00 -0700

How does our society appreciate the uniqueness of cultural products? This fundamental puzzle has intrigued scholars in many fields, including psychology, sociology, anthropology, and marketing. It has been theorized that cultural products that balance familiarity and novelty are more likely to become popular. However, a cultural product's novelty is typically multifaceted. This paper uses songs as a case study to study the multiple facets of uniqueness and their relationship with success. We first unpack the multiple facets of a song's novelty or uniqueness and, next, measure its impact on a song's popularity. We employ a series of statistical models to study the relationship between a song's popularity and novelty associated with its lyrics, chord progressions, or audio properties. Our analyses performed on a dataset of over fifty thousand songs find a consistently negative association between all types of song novelty and popularity. Overall we found a song's lyrics uniqueness to have the most significant association with its popularity. However, audio uniqueness was the strongest predictor of a song's popularity, conditional on the song's genre. We further found the theme and repetitiveness of a song's lyrics to mediate the relationship between the song's popularity and novelty. Broadly, our results contradict the "optimal distinctiveness theory'' (balance between novelty and familiarity) and call for an investigation into the multiple dimensions along which a cultural product's uniqueness could manifest.

Conversation Modeling to Predict Derailment

Jiaqing Yuan, Munindar P. Singh — Fri, 02 Jun 2023 00:00:00 -0700

Conversations among online users sometimes derail, i.e., break down into personal attacks. Derailment interferes with the healthy growth of communities in cyberspace. The ability to predict whether an ongoing conversation will derail could provide valuable advance, even real-time, insight to both interlocutors and moderators. Prior approaches predict conversation derailment retrospectively without the ability to forestall the derailment proactively. Some existing works attempt to make dynamic predictions as the conversation develops, but fail to incorporate multisource information, such as conversational structure and distance to derailment. We propose a hierarchical transformer-based framework that combines utterance-level and conversation-level information to capture fine-grained contextual semantics. We propose a domain-adaptive pretraining objective to unite conversational structure information and a multitask learning scheme to leverage the distance from each utterance to derailment. An evaluation of our framework on two conversation derailment datasets shows an improvement in F1 score for the prediction of derailment. These results demonstrate the effectiveness of incorporating multisource information for predicting the derailment of a conversation.

Minority Stress Experienced by LGBTQ Online Communities during the COVID-19 Pandemic

Yunhao Yuan, Gaurav Verma, Barbara Keller, Talayeh Aledavood — Fri, 02 Jun 2023 00:00:00 -0700

The COVID-19 pandemic has disproportionately impacted the lives of minorities, such as members of the LGBTQ community (lesbian, gay, bisexual, transgender, and queer) due to pre-existing social disadvantages and health disparities. Although extensive research has been carried out on the impact of the COVID-19 pandemic on different aspects of the general population's lives, few studies are focused on the LGBTQ population. In this paper, we develop and evaluate two sets of machine learning classifiers using a pre-pandemic and a during-pandemic dataset to identify Twitter posts exhibiting minority stress, which is a unique pressure faced by the members of the LGBTQ population due to their sexual and gender identities. We demonstrate that our best pre- and during-pandemic models show strong and stable performance for detecting posts that contain minority stress. We investigate the linguistic differences in minority stress posts across pre- and during-pandemic periods. We find that anger words are strongly associated with minority stress during the COVID-19 pandemic. We explore the impact of the pandemic on the emotional states of the LGBTQ population by adopting propensity score-based matching to perform a causal analysis. The results show that the LGBTQ population have a greater increase in the usage of cognitive words and worsened observable attribute in the usage of positive emotion words than the group of the general population with similar pre-pandemic behavioral attributes. Our findings have implications for the public health domain and policy-makers to provide adequate support, especially with respect to mental health, to the LGBTQ population during future crises.

How Circadian Rhythms Extracted from Social Media Relate to Physical Activity and Sleep

Ke Zhou, Marios Constantinides, Daniele Quercia, Sanja Šćepanović — Fri, 02 Jun 2023 00:00:00 -0700

Circadian rhythm has been linked to both physical and mental health at an individual level in prior research. Such a link at population level has been long hypothesized but has never been tested, largely because of lack of data. To partly fix this literature gap, we need: a dataset on population-level circadian rhythms, a dataset on population-level health conditions, and strong associations between these two partly independent sets. Recent work has shown that affect on social media data relates to population-level circadian rhythms. Building upon that work, we extracted five circadian rhythm metrics from 6M Reddit posts across 18 major cities (for which the number of residents is highly correlated with the number of users), and paired them with three ground-truth health metrics (daily number of steps, sleep quantity, and sleep quality) extracted from 233K wearable users in these cities. We found that rhythms of online activity approximated sleeping patterns rather than, what the literature previously hypothesized, alertness levels. Despite that, we found that these rhythms, when computed in two specific times of the day (i.e., late at night and early morning), were still predictive of the three ground-truth health metrics: in general, healthier cities had morning spikes on social media, night dips, and expressions of positive affect. These results suggest that circadian rhythms on social media, if taken at two specific times of the day and operationalized with literature-driven metrics, can approximate the temporal evolution of people's shared underlying biological rhythm as it relates to physical activity (R2=0.492), sleep quantity (R2=0.765), and sleep quality (R2=0.624).

Who Is behind a Trend? Temporal Analysis of Interactions among Trend Participants on Twitter

John Ziegler, Michael Gertz — Fri, 02 Jun 2023 00:00:00 -0700

Trends are a fundamental component of today's fast-evolving media landscape. Still, a lot of questions about who participates in such trends remain unanswered. Are trends driven by individual actors, or do interactions between actors reveal community structures? If so, do those structures change during the life cycle of a trend or between topically similar trends? In short: Who is behind a trend? This paper contributes to a better understanding of these questions and, in general, actor networks underlying trends on social media. As a case study, we leverage a large Twitter dataset from the EURO2020 soccer competition to detect and analyze topical trends. Our novel Gaussian fitting method allows separating trend life cycles into up- and down-trend components, as well as determining the duration of trends. An event-based evaluation proves good performance results. Given separate trend stages and topically similar trends at different points in time, we conduct a temporal analysis of the actor networks during trends. Our findings not only reveal a large overlap of participants between successive trends but also indicate large variations within a trend life cycle. Furthermore, actor networks seem to be centred around a small number of dominant users and communities. Those users also show large stability across similar trends over time. In contrast, temporally stable community structures are neither found within nor across topically similar trends.

Towards Generalization of Machine Learning Models: A Case Study of Arabic Sentiment Analysis

Samir Abdaljalil, Shaimaa Hassanein, Hamdy Mubarak, Ahmed Abdelali — Fri, 02 Jun 2023 00:00:00 -0700

The abundance of social media data in the Arab world, specifically on Twitter, enabled companies and entities to exploit such rich and beneficial data that could be mined and used to extract important information, including sentiments and opinions of people towards a topic or a merchandise. However, with this plenitude comes the issue of producing models that are able to deliver consistent outcomes when tested within various contexts. Although model generalization has been thoroughly investigated in many fields, it has not been heavily investigated in the Arabic context. To address this gap, we investigate the generalization of models and data in Arabic with application to sentiment analysis, by performing a battery of experiments and building different models that are tested on five independent test sets to understand their performance when presented with unseen data. In doing so, we detail different techniques that improve the generalization of machine learning models in Arabic sentiment analysis, and share a large versatile dataset consisting of approximately 1.64M Arabic tweets and their corresponding sentiment to be used for future research. Our experiments concluded that the most consistent model is trained using a dataset labelled by a cascaded approach of two models, one that labels neutral tweets and another that identifies positive/negative tweets based on the Arabic emoji lexicon after class balancing. Both the BERT and the SVM models trained using the refined data achieve an average F-1 score of 0.62 and 0.60, and standard deviation of 0.06 and 0.04 respectively, when evaluated on five diverse test sets, outperforming other models by at least 17% relative gain in F-1. Based on our experiments, we share recommendations to improve model generalization for classification tasks.

A Multi-Platform Collection of Social Media Posts about the 2022 U.S. Midterm Elections

Fri, 02 Jun 2023 00:00:00 -0700

Social media are utilized by millions of citizens to discuss important political issues. Politicians use these platforms to connect with the public and broadcast policy positions. Therefore, data from social media has enabled many studies of political discussion. While most analyses are limited to data from individual platforms, people are embedded in a larger information ecosystem spanning multiple social networks. Here we describe and provide access to the Indiana University 2022 U.S. Midterms Multi-Platform Social Media Dataset (MEIU22), a collection of social media posts from Twitter, Facebook, Instagram, Reddit, and 4chan. MEIU22 links to posts about the midterm elections based on a comprehensive list of keywords and tracks the social media accounts of 1,011 candidates from October 1 to December 25, 2022. We also publish the source code of our pipeline to enable similar multi-platform research projects.

Wiki-Based Communities of Interest: Demographics and Outliers

Hiba Arnaout, Simon Razniewski, Jeff Z. Pan — Fri, 02 Jun 2023 00:00:00 -0700

In this paper, we release data about demographic information and outliers of communities of interest. Identified from Wiki-based sources, mainly Wikidata, the data covers 7.5k communities, e.g., members of the White House Coronavirus Task Force, and 345k subjects, e.g., Deborah Birx. We describe the statistical inference methodology adopted to mine such data. We release subject-centric and group-centric datasets in JSON format, as well as a browsing interface. Finally, we forsee three areas where this dataset can be useful: in social sciences research, it provides a resource for demographic analyses; in web-scale collaborative encyclopedias, it serves as an edit recommender to fill knowledge gaps; and in web search, it offers lists of salient statements about queried subjects for higher user engagement. The dataset can be accessed at: https://doi.org/10.5281/zenodo.7410436

#RoeOverturned: Twitter Dataset on the Abortion Rights Controversy

Fri, 02 Jun 2023 00:00:00 -0700

On June 24, 2022, the United States Supreme Court overturned landmark rulings made in its 1973 verdict in Roe v. Wade. The justices by way of a majority vote in Dobbs v. Jackson Women's Health Organization, decided that abortion wasn't a constitutional right and returned the issue of abortion to the elected representatives. This decision triggered multiple protests and debates across the US, especially in the context of the midterm elections in November 2022. Given that many citizens use social media platforms to express their views and mobilize for collective action, and given that online debate provides tangible effects on public opinion, political participation, news media coverage, and the political decision-making, it is crucial to understand online discussions surrounding this topic. Toward this end, we present the first large-scale Twitter dataset collected on the abortion rights debate in the United States. We present a set of 74M tweets systematically collected over the course of one year from January 1, 2022 to January 6, 2023.

Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War between Ukraine and Russia

Emily Chen, Emilio Ferrara — Fri, 02 Jun 2023 00:00:00 -0700

On February 24, 2022, Russia invaded Ukraine. In the days that followed, reports kept flooding in from laymen to news anchors of a conflict quickly escalating into war. Russia faced immediate backlash and condemnation from the world at large. While the war continues to contribute to an ongoing humanitarian and refugee crisis in Ukraine, a second battlefield has emerged in the online space, both in the use of social media to garner support for both sides of the conflict and also in the context of information warfare. In this paper, we present a collection of nearly half a billion tweets, from February 22, 2022, through January 8, 2023, that we are publishing for the wider research community to use. This dataset can be found at https://github.com/echen102/ukraine-russia. Our preliminary analysis on a subset of our dataset already shows evidence of public engagement with Russian state-sponsored media and other domains that are known to push unreliable information towards the beginning of the war; the former saw a spike in activity on the day of the Russian invasion, while the other saw spikes in engagement within the first month of the war. Our hope is that this public dataset can help the research community to further understand the ever-evolving role that social media plays in information dissemination, influence campaigns, grassroots mobilization, and much more, during a time of conflict.

HateMM: A Multi-Modal Dataset for Hate Video Classification

Fri, 02 Jun 2023 00:00:00 -0700

Hate speech has become one of the most significant issues in modern society, having implications in both the online and the offline world. Due to this, hate speech research has recently gained a lot of traction. However, most of the work has primarily focused on text media with relatively little work on images and even lesser on videos. Thus, early stage automated video moderation techniques are needed to handle the videos that are being uploaded to keep the platform safe and healthy. With a view to detect and remove hateful content from the video sharing platforms, our work focuses on hate video detection using multi-modalities. To this end, we curate ~43 hours of videos from BitChute and manually annotate them as hate or non-hate, along with the frame spans which could explain the labelling decision. To collect the relevant videos we harnessed search keywords from hate lexicons. We observe various cues in images and audio of hateful videos. Further, we build deep learning multi-modal models to classify the hate videos and observe that using all the modalities of the videos improves the overall hate speech detection performance (accuracy=0.798, macro F1-score=0.790) by ~5.7% compared to the best uni-modal model in terms of macro F1 score. In summary, our work takes the first step toward understanding and modeling hateful videos on video hosting platforms such as BitChute.

HealthE: Recognizing Health Advice & Entities in Online Health Communities

Fri, 02 Jun 2023 00:00:00 -0700

The task of extracting and classifying entities is at the core of important Health-NLP systems such as misinformation detection, medical dialogue modeling, and patient-centric information tools. Granular knowledge of textual entities allows these systems to utilize knowledge bases, retrieve relevant information, and build graphical representations of texts. Unfortunately, most existing works on health entity recognition are trained on clinical notes, which are both lexically and semantically different from public health information found in online health resources or social media. In other words, existing health entity recognizers vastly under-represent the entities relevant to public health data, such as those provided by sites like WebMD. It is crucial that future Health-NLP systems be able to model such information, as people rely on online health advice for personal health management and clinically relevant decision making. In this work, we release a new annotated dataset, HealthE, which facilitates the large-scale analysis of online textual health advice. HealthE consists of 3,400 health advice statements with token-level entity annotations. Additionally, we release 2,256 health statements which are not health advice to facilitate health advice mining. HealthE is the first dataset with an entity-recognition label space designed for the modeling of online health advice. We motivate the need for HealthE by demonstrating the limitations of five widely-used health entity recognizers on HealthE, such as those offered by Google and Amazon. We additionally benchmark three pre-trained language models on our dataset as reference for future research. All data is made publicly available.

Truth Social Dataset

Patrick Gerard, Nicholas Botzer, Tim Weninger — Fri, 02 Jun 2023 00:00:00 -0700

Formally announced to the public following former President Donald Trump’s bans and suspensions from mainstream social networks in early 2022 following his role in the January 6 Capitol Riots, Truth Social was launched as an ``alternative'' social media platform that claims to be a refuge for free speech, offering a platform for those disaffected by the content moderation policies of then existing, mainstream social networks. The subsequent rise of Truth Social has been driven largely by hard-line supporters of the former president as well as those affected by the content moderation of other social networks. These distinct qualities combined with the its status as the main mouthpiece of the former president positions Truth Social as a particularly influential social media platform and give rise to several research questions. However, outside of a handful of news reports, little is known about the new social media platform partially due to a lack of well-curated data. In the current work, we describe a dataset of over 823,000 posts to Truth Social and and social network with over 454,000 distinct users. In addition to the dataset itself, we also present some basic analysis of its content, certain temporal features, and its network.

Construction of Evaluation Datasets for Trend Forecasting Studies

Shogo Matsuno, Sakae Mizuki, Takeshi Sakaki — Fri, 02 Jun 2023 00:00:00 -0700

In this study, we discuss issues in the traditional evaluation norms of trend forecasts, outline a suitable evaluation method, propose an evaluation dataset construction procedure, and publish Trend Dataset: the dataset we have created. As trend predictions often yield economic benefits, trend forecasting studies have been widely conducted. However, a consistent and systematic evaluation protocol has yet to be adopted. We consider that the desired evaluation method would address the performance of predicting which entity will trend, when a trend occurs, and how much it will trend based on a reliable indicator of the general public's recognition as a gold standard. Accordingly, we propose a dataset construction method that includes annotations for trending status (trending or non-trending), degree of trending (how well it is recognized), and the trend period corresponding to a surge in recognition rate. The proposed method uses questionnaire-based recognition rates interpolated using Internet search volume, enabling trend period annotation on a weekly timescale. The main novelty is that we survey when the respondents recognize the entities that are highly likely to have trended and those that haven't. This procedure enables a balanced collection of both trending and non-trending entities. We constructed the dataset and verified its quality. We confirmed that the interests of entities estimated using Wikipedia information enables the efficient collection of trending entities a priori. We also confirmed that the Internet search volume agrees with public recognition rate among trending entities.

VaxxHesitancy: A Dataset for Studying Hesitancy towards COVID-19 Vaccination on Twitter

Fri, 02 Jun 2023 00:00:00 -0700

Vaccine hesitancy has been a common concern, probably since vaccines were created and, with the popularisation of social media, people started to express their concerns about vaccines online alongside those posting pro- and anti-vaccine content. Predictably, since the first mentions of a COVID-19 vaccine, social media users posted about their fears and concerns or about their support and belief into the effectiveness of these rapidly developing vaccines. Identifying and understanding the reasons behind public hesitancy towards COVID-19 vaccines is important for policy markers that need to develop actions to better inform the population with the aim of increasing vaccine take-up. In the case of COVID-19, where the fast development of the vaccines was mirrored closely by growth in anti-vaxx disinformation, automatic means of detecting citizen attitudes towards vaccination became necessary. This is an important computational social sciences task that requires data analysis in order to gain in-depth understanding of the phenomena at hand. Annotated data is also necessary for training data-driven models for more nuanced analysis of attitudes towards vaccination. To this end, we created a new collection of over 3,101 tweets annotated with users' attitudes towards COVID-19 vaccination (stance). Besides, we also develop a domain-specific language model (VaxxBERT) that achieves the best predictive performance (73.0 accuracy and 69.3 F1-score) as compared to a robust set of baselines. To the best of our knowledge, these are the first dataset and model that model vaccine hesitancy as a category distinct from pro- and anti-vaccine stance.

Capturing the Aftermath of the Dobbs v. Jackson Women’s Health Organization Decision in Google Search Results across the U.S.

Brooke Perreault, Lan Dau, Anya Wintner, Eni Mustafaraj — Fri, 02 Jun 2023 00:00:00 -0700

How do Google Search results change following an impactful real-world event, such as the U.S. Supreme Court decision on June 24, 2022 to overturn Roe v. Wade? And what do they tell us about the nature of event-driven content, generated by various participants in the online information environment? In this paper, we present a dataset of more than 1.74 million Google Search results pages collected between June 24 and July 17, 2022, intended to capture what Google Search surfaced in response to queries about this event of national importance. These search pages were collected for 65 locations in 13 U.S. states, a mix of red, blue, and purple states, with respect to their voting patterns. We describe the process of building a set of circa 1,700 phrases used for searching Google, how we gathered the search results for each location, and how these results were parsed to extract information about the most frequently encountered web domains. We believe that this dataset, which comprises raw data (search results as HTML files) and processed data (extracted links organized as CSV files) can be used to answer research questions that are of interest to computational social scientists as well as communication and media studies scholars.

Just Another Day on Twitter: A Complete 24 Hours of Twitter Data

Fri, 02 Jun 2023 00:00:00 -0700

At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected all 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.

Codes, Patterns and Shapes of Contemporary Online Antisemitism and Conspiracy Narratives – an Annotation Guide and Labeled German-Language Dataset in the Context of COVID-19

Fri, 02 Jun 2023 00:00:00 -0700

Over the course of the COVID-19 pandemic, existing conspiracy theories were refreshed and new ones were created, often interwoven with antisemitic narratives, stereotypes and codes. The sheer volume of antisemitic and conspiracy theory content on the Internet makes data-driven algorithmic approaches essential for anti-discrimination organizations and researchers alike. However, the manifestation and dissemination of these two interrelated phenomena is still quite under-researched in scholarly empirical research of large text corpora. Algorithmic approaches for the detection and classification of specific contents usually require labeled datasets, annotated based on conceptually sound guidelines. While there is a growing number of datasets for the more general phenomenon of hate speech, the development of corpora and annotation guidelines for antisemitic and conspiracy content is still in its infancy, especially for languages other than English. To address this gap, we have developed an annotation guide for antisemitic and conspiracy theory online content in the context of the COVID-19 pandemic that includes working definitions, e.g. of specific forms of antisemitism such as encoded and post-Holocaust antisemitism. We use the guide to annotate a German-language dataset consisting of $\sim \! 3,700$ Telegram messages sent between 03/2020 and 12/2021.

Invasion@Ukraine: Providing and Describing a Twitter Streaming Dataset That Captures the Outbreak of War between Russia and Ukraine in 2022

Janina Susanne Pohl, Simon Markmann, Dennis Assenmacher, Christian Grimme — Fri, 02 Jun 2023 00:00:00 -0700

Social media can be a mirror of human interaction, society, and historic disruptions. Their reach enables the global dissemination of information in the shortest possible time and, thus, the individual participation of people worldwide in global events in almost real-time. However, these platforms can be equally efficiently used in information warfare to manipulate human perception and opinion formation. Within this paper, we describe a dataset of raw tweets collected via the Twitter Streaming API in the context of the onset of the war, which Russia started in Ukraine on February 24, 2022. A distinctive feature of the dataset is that it covers the period from one week before to one week after Russia invasion of Ukraine. This paper details the acquisition process and provides first insights into the content of the data stream. In addition, the data has been annotated with availability tags, resulting from rehydration attempts at two points in time: directly after data acquisition and shortly before manuscript submission. This may provide information on Twitter moderation policies. Further, we provide a detailed list of other published dataset covering the same topic. On the content level, we can show that our dataset comprises several distinct topics related to the conflict and conspiracy narratives -- topics that deserve more profound investigation. Therefore, the presented dataset is also made available to the community in an extended version with pseudonymized tweet content upon request.

YouNICon: YouTube’s CommuNIty of Conspiracy Videos

Shao Yi Liaw, Fan Huang, Fabricio Benevenuto, Haewoon Kwak, Jisun An — Fri, 02 Jun 2023 00:00:00 -0700

Conspiracy theories are widely propagated on social media. Among various social media services, YouTube is one of the most influential sources of news and entertainment. This paper seeks to develop a dataset, YOUNICON, to enable researchers to perform conspiracy theory detection as well as classification of videos with conspiracy theories into different topics. YOUNICON is a dataset with a large collection of videos from suspicious channels that were identified to contain conspiracy theories in a previous study. Overall, YOUNICON will enable researchers to study trends in conspiracy theories and understand how individuals can interact with the conspiracy theory producing community or channel. Our data is available at: https://doi.org/10.5281/zenodo.7466262.

A Dataset of Coordinated Cryptocurrency-Related Social Media Campaigns

Karolis Zilius, Tasos Spiliotopoulos, Aad van Moorsel — Fri, 02 Jun 2023 00:00:00 -0700

The rise in adoption of cryptoassets has brought many new and inexperienced investors in the cryptocurrency space. These investors can be disproportionally influenced by information they receive online, and particularly from social media. This paper presents a dataset of crypto-related bounty events and the users that participate in them. These events coordinate social media campaigns to create artificial "hype" around a crypto project in order to influence the price of its token. The dataset consists of information about 15.8K cross-media bounty events, 185K participants, 10M forum comments and 82M social media URLs collected from the Bounties(Altcoins) subforum of the BitcoinTalk online forum from May 2014 to December 2022. We describe the data collection and the data processing methods employed and we present a basic characterization of the dataset. Furthermore, we discuss potential research opportunities afforded by the dataset across many disciplines and we highlight potential novel insights into how the cryptocurrency industry operates and how it interacts with its audience.

Divergences in Following Patterns between Influential Twitter Users and Their Audiences across Dimensions of Identity

Suyash Fulay, Nabeel Gillani, Deb Roy — Fri, 02 Jun 2023 00:00:00 -0700

Identity spans multiple dimensions; however, the relative salience of a dimension of identity can vary markedly from person to person. Furthermore, there is often a difference between one’s internal identity (how salient different aspects of one's identity are to oneself) and external identity (how salient different aspects are to the external world). We attempt to capture the internal and external saliences of different dimensions of identity for influential users (“influencers”) on Twitter using the follow graph. We consider an influencer’s “ego-centric” profile, which is determined by their personal following patterns and is largely in their direct control, and their “audience-centric” profile, which is determined by the following patterns of their audience and is outside of their direct control. Using these following patterns we calculate a corresponding salience metric that quantifies how important a certain dimension of identity is to an individual. We find that relative to their audiences, influencers exhibit more salience in race in their ego-centric profiles and less in religion and politics. One practical application of these findings is to identify "bridging" influencers that can connect their sizeable audiences to people from traditionally underheard communities. This could potentially increase the diversity of views audiences are exposed to through a trusted conduit (i.e. an influencer they already follow) and may lead to a greater voice for influencers from communities of color or women.

Firearms on Twitter: A Novel Object Detection Pipeline

Ryan Harvey, Rémi Lebret, Stéphane Massonnet, Karl Aberer, Gianluca Demartini — Fri, 02 Jun 2023 00:00:00 -0700

Social media is an important source of real-time imagery concerning world events. One subset of social media posts which may be of particular interest are those featuring firearms. These posts can give insight into weapon movements, troop activity and civilian safety. Object detection tools offer important opportunities for insight into these images. Unfortunately, these images can be visually complex, poorly lit and generally challenging for object detection models. We present an analysis of existing gun detection datasets, and find that these datasets to not effectively address the challenge of gun detection on real-life images. Following this, we present a novel object detection pipeline. We train our pipeline on a number of datasets including one created for this investigation made up of Twitter images of the Russo-Ukrainian War. We compare the performance of our model as trained on the different datasets to baseline numbers provided by original authors as well as a YOLO v5 benchmark. We find that our model outperforms the state-of-the-art benchmarks on contextually rich, real-life-derived imagery of firearms.

Auditing Elon Musk’s Impact on Hate Speech and Bots

Fri, 02 Jun 2023 00:00:00 -0700

On October 27th, 2022, Elon Musk purchased Twitter, becoming its new CEO and firing many top executives in the process. Musk listed fewer restrictions on content moderation and removal of spam bots among his goals for the platform. Given findings of prior research on moderation and hate speech in online communities, the promise of less strict content moderation poses the concern that hate will rise on Twitter. We examine the levels of hate speech and prevalence of bots before and after Musk's acquisition of the platform. We find that hate speech rose dramatically upon Musk purchasing Twitter and the prevalence of most types of bots increased, while the prevalence of astroturf bots decreased.

The Amplification Paradox in Recommender Systems

Manoel Horta Ribeiro, Veniamin Veselovsky, Robert West — Fri, 02 Jun 2023 00:00:00 -0700

Automated audits of recommender systems found that blindly following recommendations leads users to increasingly partisan, conspiratorial, or false content. At the same time, studies using real user traces suggest that recommender systems are not the primary driver of attention toward extreme content; on the contrary, such content is mostly reached through other means, e.g., other websites. In this paper, we explain the following apparent paradox: if the recommendation algorithm favors extreme content, why is it not driving its consumption? With a simple agent-based model where users attribute different utilities to items in the recommender system, we show through simulations that the collaborative-filtering nature of recommender systems and the nicheness of extreme content can resolve the apparent paradox: although blindly following recommendations would indeed lead users to niche content, users rarely consume niche content when given the option because it is of low utility to them, which can lead the recommender system to deamplify such content. Our results call for a nuanced interpretation of "algorithmic amplification" and highlight the importance of modeling the utility of content to users when auditing recommender systems. Code available: https://github.com/epfl-dlab/amplification_paradox.

Host-Centric Social Connectedness of Migrants in Europe on Facebook

Aparup Khatua, Emilio Zagheni, Ingmar Weber — Fri, 02 Jun 2023 00:00:00 -0700

Extant literature has explored the social integration process of migrants settling in host communities. However, this literature typically takes a migrant-centric view, implicitly putting the burden of a successful integration on the migrant, and trying to identify the factors that lead to integration along various dimensions. In this paper, we flip this point of view by studying the attributes of natives that govern their propensity to form social ties with migrants.We do so by using anonymous and aggregate social network data provided by Facebook’s advertising platform. More specifically, we look at factors that influence the propensity for a likely-to-be non-Muslim Facebook user to have at least one social connection to a Facebook user who celebrates Ramadan. Given that, in the European context, following Islam is predominantly tied to a migration background, this gives us a lens into cross-cultural native-migrant connectivity. Our study considers demographic attributes of the host population, such as age, gender, and education level, as well as spatial variation across 30 European cities. Our findings suggest that young, educated, and male Facebook users are relatively more likely to build cross-cultural ties, compared to older, less educated, and female Facebook users. We also observe heterogeneity across the analyzed cities.

Characterizing Coin-Based Voting Governance in DPoS Blockchains

Chao Li, Runhua Xu, Li Duan — Fri, 02 Jun 2023 00:00:00 -0700

Delegated-Proof-of-Stake (DPoS) blockchains are governed by a committee of dozens of members elected via coin-based voting mechanisms. This paper presents a large-scale empirical study of two critical characteristics, personal impact and participation rate, of three leading DPoS blockchains. Our findings reveal the existence of decisive voters whose votes can alter election outcomes, as well as the fact that almost half of the coins have never been used in committee elections. Our research contributes to demystifying the actual use of coin-based voting governance and offers novel insights into the potential security risks of DPoS blockchains.

Different Affordances on Facebook and SMS Text Messaging Do Not Impede Generalization of Language-Based Predictive Models

Fri, 02 Jun 2023 00:00:00 -0700

Adaptive mobile device-based health interventions often use machine learning models trained on non-mobile device data, such as social media text, due to the difficulty and high expense of collecting large text message (SMS) data. Therefore, understanding the differences and generalization of models between these platforms is crucial for proper deployment. We examined the psycho-linguistic differences between Facebook and text messages, and their impact on out-of-domain model performance, using a sample of 120 users who shared both. We found that users use Facebook for sharing experiences (e.g., leisure) and SMS for task-oriented and conversational purposes (e.g., plan confirmations), reflecting the differences in the affordances. To examine the downstream effects of these differences, we used pre-trained Facebook-based language models to estimate age, gender, depression, life satisfaction, and stress on both Facebook and SMS. We found no significant differences in correlations between the estimates and self-reports across 6 of 8 models. These results suggest using pre-trained Facebook language models to achieve better accuracy with just-in-time interventions.

An Example of (Too Much) Hyper-Parameter Tuning In Suicide Ideation Detection

Annika Marie Schoene, John Ortega, Silvio Amir, Kenneth Church — Fri, 02 Jun 2023 00:00:00 -0700

This work starts with the TWISCO baseline, a benchmark of suicide-related content from Twitter. We find that hyper-parameter tuning can improve this baseline by 9%. We examined 576 combinations of hyper-parameters: learning rate, batch size, epochs and date range of training data. Reasonable settings of learning rate and batch size produce better results than poor settings. Date range is less conclusive. Balancing the date range of the training data to match the benchmark ought to improve performance, but the differences are relatively small. Optimal settings of learning rate and batch size are much better than poor settings, but optimal settings of date range are not that different from poor settings of date range. Finally, we end with concerns about reproducibility. Of the 576 experiments, 10% produced F1 performance above baseline. It is common practice in the literature to run many experiments and report the best, but doing so may be risky, especially given the sensitive nature of Suicide Ideation Detection.

The Half-Life of a Tweet

Jürgen Pfeffer, Daniel Matter, Anahit Sargsyan — Fri, 02 Jun 2023 00:00:00 -0700

Twitter has started to share an impression count variable as part of the available public metrics for every Tweet collected with Twitter’s APIs. With the information about how often a particular Tweet has been shown to Twitter users at the time of data collection, we can learn important insights about the dissemination process of a Tweet by measuring its impression count repeatedly over time. With our preliminary analysis, we can show that on average the peak of impressions per second is 72 seconds after a Tweet was sent and that after 24 hours, no relevant number of impressions can be observed for ∼95% of all Tweets. Finally, we estimate that the median half-life of a Tweet, i.e. the time it takes before half of all impressions are created, is about 80 minutes.