Learning Student Networks with Few Data
Recently, the teacher-student learning paradigm has drawn much attention in compressing neural networks on low-end edge devices, such as mobile phones and wearable watches. Current algorithms mainly assume the complete dataset for the teacher network is also available for the training of the student network. However, for real-world scenarios, users may only have access to part of training examples due to commercial profits or data privacy, and severe over-fitting issues would happen as a result. In this paper, we tackle the challenge of learning student networks with few data by investigating the ground-truth data-generating distribution underlying these few data. Taking Wasserstein distance as the measurement, we assume this ideal data distribution lies in a neighborhood of the discrete empirical distribution induced by the training examples. Thus we propose to safely optimize the worst-case cost within this neighborhood to boost the generalization. Furthermore, with theoretical analysis, we derive a novel and easy-to-implement loss for training the student network in an end-to-end fashion. Experimental results on benchmark datasets validate the effectiveness of our proposed method.