The LGBTQ+ Minority Stress on Social Media (MiSSoM) Dataset: A Labeled Dataset for Natural Language Processing and Machine Learning

Authors

  • Cory J. Cascalheira New Mexico State University
  • Santosh Chapagain Utah State University
  • Ryan E. Flinn University of North Dakota
  • Dannie Klooster Oklahoma State University
  • Danica Laprade Northern Arizona University
  • Yuxuan Zhao New Mexico State University
  • Emily M. Lund University of Alabama
  • Alejandra Gonzalez Xavier University
  • Kelsey Corro New Mexico State University
  • Rikki Wheatley University of Oregon
  • Ana Gutierrez New Mexico State University
  • Oziel Garcia Villanueva New Mexico State University
  • Koustuv Saha University of Illinois Urbana-Champaign
  • Munmun De Choudhury Georgia Institute of Technology
  • Jillian R. Scheer Syracuse University
  • Shah M. Hamdi Utah State University

DOI:

https://doi.org/10.1609/icwsm.v18i1.31433

Abstract

Minority stress is the leading theoretical construct for understanding LGBTQ+ health disparities. As such, there is an urgent need to develop innovative policies and technologies to reduce minority stress. To spur technological innovation, we created the largest labeled datasets on minority stress using natural language from subreddits related to sexual and gender minority people. A team of mental health clinicians, LGBTQ+ health experts, and computer scientists developed two datasets: (1) the publicly available LGBTQ+ Minority Stress on Social Media (MiSSoM) dataset and (2) the advanced request-only version of the dataset, LGBTQ+ MiSSoM+. Both datasets have seven labels related to minority stress, including an overall composite label and six sublabels. LGBTQ+ MiSSoM (N = 27,709) includes both human- and machine-annotated la-bels and comes preprocessed with features (e.g., topic models, psycholinguistic attributes, sentiment, clinical keywords, word embeddings, n-grams, lexicons). LGBTQ+ MiSSoM+ includes all the characteristics of the open-access dataset, but also includes the original Reddit text and sentence-level labeling for a subset of posts (N = 5,772). Benchmark supervised machine learning analyses revealed that features of the LGBTQ+ MiSSoM datasets can predict overall minority stress quite well (F1 = 0.869). Benchmark performance metrics yielded in the prediction of the other labels, namely prejudiced events (F1 = 0.942), expected rejection (F1 = 0.964), internalized stigma (F1 = 0.952), identity concealment (F1 = 0.971), gender dysphoria (F1 = 0.947), and minority coping (F1 = 0.917), were excellent. Descriptive analyses, ethical considerations, limitations, and possible use cases are provided.

Downloads

Published

2024-05-28

How to Cite

Cascalheira, C. J., Chapagain, S., Flinn, R. E., Klooster, D., Laprade, D., Zhao, Y., Lund, E. M., Gonzalez, A., Corro, K., Wheatley, R., Gutierrez, A., Garcia Villanueva, O., Saha, K., De Choudhury, M., Scheer, J. R., & Hamdi, S. M. (2024). The LGBTQ+ Minority Stress on Social Media (MiSSoM) Dataset: A Labeled Dataset for Natural Language Processing and Machine Learning. Proceedings of the International AAAI Conference on Web and Social Media, 18(1), 1888-1899. https://doi.org/10.1609/icwsm.v18i1.31433