Efficient Filtering on Hidden Document Streams

Authors

  • Eduardo Ruiz University California Riverside
  • Vagelis Hristidis University California Riverside
  • Panagiotis Ipeirotis New York Univesity

DOI:

https://doi.org/10.1609/icwsm.v8i1.14537

Keywords:

Hidden Stream, Social Network Data Adquisition, Adaptive Document Selection

Abstract

Many online services like Twitter and GNIP offer streaming programming interfaces that allow real-time information filtering based on keyword or other conditions. However, all these services specify strict access constraints, or charge a cost based on the usage. We refer to such streams as ``hidden streams'' to draw a parallel to the well-studied hidden Web, which similarly restricts access to the contents of a database through a querying interface. At the same time, the users' interest is often captured by complex classification models that, implicitly or explicitly, specify hundreds of keyword-based rules, along with the rules' accuracies. In this paper, we study how to best utilize a constrained streaming access interface to maximize the number of retrieved relevant items, with respect to a classifier, expressed as a set of rules. We consider two problem variants. The static version assumes that the popularity of the keywords is known and constant across time. The dynamic version lifts this assumption, and can be viewed as an exploration-vs.-exploitation problem. We show that both problems are NP-hard, and propose exact and bounded approximation algorithms for various settings, including various access constraint types. We experimentally evaluate our algorithms on real Twitter data.

Downloads

Published

2014-05-16

How to Cite

Ruiz, E., Hristidis, V., & Ipeirotis, P. (2014). Efficient Filtering on Hidden Document Streams. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 446-455. https://doi.org/10.1609/icwsm.v8i1.14537