Filtering Noisy Web Data by Identifying and Leveraging Users' Contributions
Keywords:noise filtering, bipartite graphs
In this paper we present several methods for collecting Web textual contents and filtering noisy data. We show that knowing which user publishes which contents can contribute to detecting noise. We begin by collecting data from two forums and from Twitter. For the forums, we extract the meaningful information from each discussion (texts of question and answers, IDs of users, date). For the Twitter dataset, we first detect tweets with very similar texts, which helps avoiding redundancy in further analysis. Also, this leads us to clusters of tweets that can be used in the same way as the forum discussions: they can be modeled by bipartite graphs. The analysis of nodes of the resulting graphs shows that network structure and content type (noisy or relevant) are not independent, so network studying can help in filtering noise.