Leveraging Large Language Models for Review Classification and Rating Estimation of Mental Health Applications

Qile Wang; Moath Erqsous; Prerana Khatiwada; Abhishek Karwankar; Fatimah Mohammad Alhassan; Aishwarya Chandrasekaran; Benita Abraham; Faith Lovell; Andrew Anh Ngo; Matthew Louis Mauriello

doi:10.1609/icwsm.v19i1.35916

Authors

Qile Wang University of Delaware, USA
Moath Erqsous University of Delaware, USA
Prerana Khatiwada University of Delaware, USA
Abhishek Karwankar University of Delaware, USA
Fatimah Mohammad Alhassan University of Delaware, USA
Aishwarya Chandrasekaran University of Delaware, USA
Benita Abraham University of Delaware, USA
Faith Lovell University of Delaware, USA
Andrew Anh Ngo University of Delaware, USA
Matthew Louis Mauriello University of Delaware, USA

DOI:

https://doi.org/10.1609/icwsm.v19i1.35916

Abstract

Large Language Models (LLMs) can analyze large datasets semantically. However, research on applying LLMs for mental health text classification is relatively new and developing. Existing methods often use supervised, deep, and reinforcement learning, which rely heavily on fine-tuning and reward models. To investigate whether LLMs can assist in recommending mental health apps based on user reviews, our study collected approximately 200k user reviews from 73 mental health mobile applications. We instructed selected LLMs to classify individual reviews into 1-5 star ratings, subsequently averaging these results to derive an overall rating for each app reflecting current user feedback. While the best supervised learning method in our experiments achieved an F1-Score of 0.79 which required significantly more human effort, the GPT-4 and Gemini 1.5 Pro delivered a strong ‘out-of-the-box’ performance with an overall F1-Score of 0.76. We provide further statistical comparisons and discussions of the performance of these models for the text classification task. Using a crowdsourcing platform to determine agreement levels, we observed that human ratings align closely with GPT ratings. In addition, we analyze specific features and concerns highlighted in mental health app reviews. Alongside our analysis, we make our data available for further experimentation and benchmarking.

Leveraging Large Language Models for Review Classification and Rating Estimation of Mental Health Applications

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information