Do Not Have Enough Data? Deep Learning to the Rescue!


  • Ateret Anaby-Tavor IBM Research
  • Boaz Carmeli IBM Research
  • Esther Goldbraich IBM Research
  • Amir Kantor IBM Research
  • George Kour IBM Research
  • Segev Shlomov IBM Research
  • Naama Tepper IBM Research
  • Naama Zwerdling IBM Research



Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers' performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.




How to Cite

Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., Tepper, N., & Zwerdling, N. (2020). Do Not Have Enough Data? Deep Learning to the Rescue!. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7383-7390.



AAAI Technical Track: Natural Language Processing