Fast Nonparametric Estimation of Class Proportions in the Positive-Unlabeled Classification Setting
Estimating class proportions has emerged as an important direction in positive-unlabeled learning. Well-estimated class priors are key to accurate approximation of posterior distributions and are necessary for the recovery of true classification performance. While significant progress has been made in the past decade, there remains a need for accurate strategies that scale to big data. Motivated by this need, we propose an intuitive and fast nonparametric algorithm to estimate class proportions. Unlike any of the previous methods, our algorithm uses a sampling strategy to repeatedly (1) draw an example from the set of positives, (2) record the minimum distance to any of the unlabeled examples, and (3) remove the nearest unlabeled example. We show that the point of sharp increase in the recorded distances corresponds to the desired proportion of positives in the unlabeled set and train a deep neural network to identify that point. Our distance-based algorithm is evaluated on forty datasets and compared to all currently available methods. We provide evidence that this new approach results in the most accurate performance and can be readily used on large datasets.