A Framework for Multistream Regression With Direct Density Ratio Estimation
Keywords:data stream, regression, density ratio estimation
Regression over a stream of data is challenging due to unbounded data size and non-stationary distribution over time. Typically, a traditional supervised regression model over a data stream is trained on data instances occurring within a short time period by assuming a stationary distribution. This model is later used to predict value of response-variable in future instances. Over time, the model may degrade in performance due to changes in data distribution among incoming data instances. Updating the model for change adaptation requires true value for every recent data instances, which is scarce in practice. To overcome this issue, recent studies have employed techniques that sample fewer instances to be used for model retraining. Yet, this may introduce sampling bias that adversely affects the model performance. In this paper, we study the regression problem over data streams in a novel setting. We consider two independent, yet related, non-stationary data streams, which are referred to as the source and the target stream. The target stream continuously generates data instances whose value of response variable is unknown. The source stream, however, continuously generates data instances along with corresponding value for the response-variable, and has a biased data distribution with respect to the target stream. We refer to the problem of using a model trained on the biased source stream to predict the response-variable’s value in data instances occurring on the target stream as Multistream Regression. In this paper, we describe a framework for multistream regression that simultaneously overcomes distribution bias and detects change in data distribution represented by the two streams over time using a Gaussian kernel model. We analyze the theoretical properties of the proposed approach and empirically evaluate it on both real-world and synthetic data sets. Importantly, our results indicate superior performance by the framework compared to other baseline regression methods.