Leveraging Large Language Models for Review Classification and Rating Estimation of Mental Health Applications
DOI:
https://doi.org/10.1609/icwsm.v19i1.35916Abstract
Large Language Models (LLMs) can analyze large datasets semantically. However, research on applying LLMs for mental health text classification is relatively new and developing. Existing methods often use supervised, deep, and reinforcement learning, which rely heavily on fine-tuning and reward models. To investigate whether LLMs can assist in recommending mental health apps based on user reviews, our study collected approximately 200k user reviews from 73 mental health mobile applications. We instructed selected LLMs to classify individual reviews into 1-5 star ratings, subsequently averaging these results to derive an overall rating for each app reflecting current user feedback. While the best supervised learning method in our experiments achieved an F1-Score of 0.79 which required significantly more human effort, the GPT-4 and Gemini 1.5 Pro delivered a strong ‘out-of-the-box’ performance with an overall F1-Score of 0.76. We provide further statistical comparisons and discussions of the performance of these models for the text classification task. Using a crowdsourcing platform to determine agreement levels, we observed that human ratings align closely with GPT ratings. In addition, we analyze specific features and concerns highlighted in mental health app reviews. Alongside our analysis, we make our data available for further experimentation and benchmarking.Downloads
Published
2025-06-07
How to Cite
Wang, Q., Erqsous, M., Khatiwada, P., Karwankar, A., Alhassan, F. M., Chandrasekaran, A., … Mauriello, M. L. (2025). Leveraging Large Language Models for Review Classification and Rating Estimation of Mental Health Applications. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2017–2029. https://doi.org/10.1609/icwsm.v19i1.35916
Issue
Section
Full Papers