Android Malware Detection with Weak Ground Truth Data

Authors

  • Jordan DeLoach Kansas State University
  • Doina Caragea Kansas State University
  • Xinming Ou University of South Florida

DOI:

https://doi.org/10.1609/aaai.v31i1.11106

Keywords:

Android Malware, Malware Detection, Semi-supervised Learning

Abstract

For Android malware detection, precise ground truth is a rare commodity. As security knowledge evolves, what may be considered ground truth at one moment in time may change, and apps once considered benign may turn out to be malicious. The inevitable noise in data labels poses a challenge to inferring effective machine learning classifiers. Our work is focused on approaches for learning classifiers for Android malware detection in a manner that is methodologically sound with regard to the uncertain and ever-changing ground truth in the problem space. We leverage the fact that although data labels are unavoidably noisy, a malware label is much more precise than a benign label. While you can be confident that an app is malicious, you can never be certain that a benign app is really benign, or just undetected malware. Based on this insight, we leverage a modified Logistic Regression classifier that allows us to learn from only positive and unlabeled data, without making any assumptions about benign labels. We find Label Regularized Logistic Regression to perform well for noisy app datasets, as well as datasets where there is a limited amount of positive labeled data, both of which are representative of real-world situations.

Downloads

Published

2017-02-12

How to Cite

DeLoach, J., Caragea, D., & Ou, X. (2017). Android Malware Detection with Weak Ground Truth Data. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). https://doi.org/10.1609/aaai.v31i1.11106