Filtering Noisy Web Data by Identifying and Leveraging Users' Contributions

Authors

  • Alina Stoica EDF

DOI:

https://doi.org/10.1609/icwsm.v6i1.14295

Keywords:

noise filtering, bipartite graphs

Abstract

In this paper we present several methods for collecting Web textual contents and filtering noisy data. We show that knowing which user publishes which contents can contribute to detecting noise. We begin by collecting data from two forums and from Twitter. For the forums, we extract the meaningful information from each discussion (texts of question and answers, IDs of users, date). For the Twitter dataset, we first detect tweets with very similar texts, which helps avoiding redundancy in further analysis. Also, this leads us to clusters of tweets that can be used in the same way as the forum discussions: they can be modeled by bipartite graphs. The analysis of nodes of the resulting graphs shows that network structure and content type (noisy or relevant) are not independent, so network studying can help in filtering noise.

Downloads

Published

2021-08-03

How to Cite

Stoica, A. (2021). Filtering Noisy Web Data by Identifying and Leveraging Users’ Contributions. Proceedings of the International AAAI Conference on Web and Social Media, 6(1), 583-586. https://doi.org/10.1609/icwsm.v6i1.14295