Android Malware Detection with Weak Ground Truth Data

Jordan DeLoach; Doina Caragea; Xinming Ou

doi:10.1609/aaai.v31i1.11106

Authors

Jordan DeLoach Kansas State University
Doina Caragea Kansas State University
Xinming Ou University of South Florida

DOI:

https://doi.org/10.1609/aaai.v31i1.11106

Keywords:

Android Malware, Malware Detection, Semi-supervised Learning

Abstract

For Android malware detection, precise ground truth is a rare commodity. As security knowledge evolves, what may be considered ground truth at one moment in time may change, and apps once considered benign may turn out to be malicious. The inevitable noise in data labels poses a challenge to inferring effective machine learning classifiers. Our work is focused on approaches for learning classifiers for Android malware detection in a manner that is methodologically sound with regard to the uncertain and ever-changing ground truth in the problem space. We leverage the fact that although data labels are unavoidably noisy, a malware label is much more precise than a benign label. While you can be confident that an app is malicious, you can never be certain that a benign app is really benign, or just undetected malware. Based on this insight, we leverage a modified Logistic Regression classifier that allows us to learn from only positive and unlabeled data, without making any assumptions about benign labels. We find Label Regularized Logistic Regression to perform well for noisy app datasets, as well as datasets where there is a limited amount of positive labeled data, both of which are representative of real-world situations.

Android Malware Detection with Weak Ground Truth Data

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information