Active Learning with Query Generation for Cost-Effective Text Classification
Labeling a text document is usually time consuming because it requires the annotator to read the whole document and check its relevance with each possible class label. It thus becomes rather expensive to train an effective model for text classification when it involves a large dataset of long documents. In this paper, we propose an active learning approach for text classification with lower annotation cost. Instead of scanning all the examples in the unlabeled data pool to select the best one for query, the proposed method automatically generates the most informative examples based on the classification model, and thus can be applied to tasks with large scale or even infinite unlabeled data. Furthermore, we propose to approximate the generated example with a few summary words by sparse reconstruction, which allows the annotators to easily assign the class label by reading a few words rather than the long document. Experiments on different datasets demonstrate that the proposed approach can effectively improve the classification performance while significantly reduce the annotation cost.