Leveraging Large Language Models for Review Classification and Rating Estimation of Mental Health Applications

Authors

  • Qile Wang University of Delaware, USA
  • Moath Erqsous University of Delaware, USA
  • Prerana Khatiwada University of Delaware, USA
  • Abhishek Karwankar University of Delaware, USA
  • Fatimah Mohammad Alhassan University of Delaware, USA
  • Aishwarya Chandrasekaran University of Delaware, USA
  • Benita Abraham University of Delaware, USA
  • Faith Lovell University of Delaware, USA
  • Andrew Anh Ngo University of Delaware, USA
  • Matthew Louis Mauriello University of Delaware, USA

DOI:

https://doi.org/10.1609/icwsm.v19i1.35916

Abstract

Large Language Models (LLMs) can analyze large datasets semantically. However, research on applying LLMs for mental health text classification is relatively new and developing. Existing methods often use supervised, deep, and reinforcement learning, which rely heavily on fine-tuning and reward models. To investigate whether LLMs can assist in recommending mental health apps based on user reviews, our study collected approximately 200k user reviews from 73 mental health mobile applications. We instructed selected LLMs to classify individual reviews into 1-5 star ratings, subsequently averaging these results to derive an overall rating for each app reflecting current user feedback. While the best supervised learning method in our experiments achieved an F1-Score of 0.79 which required significantly more human effort, the GPT-4 and Gemini 1.5 Pro delivered a strong ‘out-of-the-box’ performance with an overall F1-Score of 0.76. We provide further statistical comparisons and discussions of the performance of these models for the text classification task. Using a crowdsourcing platform to determine agreement levels, we observed that human ratings align closely with GPT ratings. In addition, we analyze specific features and concerns highlighted in mental health app reviews. Alongside our analysis, we make our data available for further experimentation and benchmarking.

Downloads

Published

2025-06-07

How to Cite

Wang, Q., Erqsous, M., Khatiwada, P., Karwankar, A., Alhassan, F. M., Chandrasekaran, A., … Mauriello, M. L. (2025). Leveraging Large Language Models for Review Classification and Rating Estimation of Mental Health Applications. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2017–2029. https://doi.org/10.1609/icwsm.v19i1.35916