A Framework for Multistream Regression With Direct Density Ratio Estimation

Authors

  • Ahsanul Haque University of Texas at Dallas
  • Hemeng Tao University of Texas at Dallas
  • Swarup Chandra University of Texas at Dallas
  • Jie Liu University of Texas at Dallas
  • Latifur Khan University of Computer Science at Dallas

Keywords:

data stream, regression, density ratio estimation

Abstract

Regression over a stream of data is challenging due to unbounded data size and non-stationary distribution over time. Typically, a traditional supervised regression model over a data stream is trained on data instances occurring within a short time period by assuming a stationary distribution. This model is later used to predict value of response-variable in future instances. Over time, the model may degrade in performance due to changes in data distribution among incoming data instances. Updating the model for change adaptation requires true value for every recent data instances, which is scarce in practice. To overcome this issue, recent studies have employed techniques that sample fewer instances to be used for model retraining. Yet, this may introduce sampling bias that adversely affects the model performance. In this paper, we study the regression problem over data streams in a novel setting. We consider two independent, yet related, non-stationary data streams, which are referred to as the source and the target stream. The target stream continuously generates data instances whose value of response variable is unknown. The source stream, however, continuously generates data instances along with corresponding value for the response-variable, and has a biased data distribution with respect to the target stream. We refer to the problem of using a model trained on the biased source stream to predict the response-variable’s value in data instances occurring on the target stream as Multistream Regression. In this paper, we describe a framework for multistream regression that simultaneously overcomes distribution bias and detects change in data distribution represented by the two streams over time using a Gaussian kernel model. We analyze the theoretical properties of the proposed approach and empirically evaluate it on both real-world and synthetic data sets. Importantly, our results indicate superior performance by the framework compared to other baseline regression methods.

Downloads

Published

2018-04-29

How to Cite

Haque, A., Tao, H., Chandra, S., Liu, J., & Khan, L. (2018). A Framework for Multistream Regression With Direct Density Ratio Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/11820