TY - JOUR
T1 - KronoDroid
T2 - Time-based hybrid-featured dataset for effective android malware detection and characterization
AU - Guerra-Manzanares, Alejandro
AU - Bahsi, Hayretdin
AU - Nõmm, Sven
N1 - Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2021/11
Y1 - 2021/11
N2 - Android malware evolution has been neglected by the available data sets, thus providing a static snapshot of a non-stationary phenomenon. The impact of the time variable has not had the deserved attention by the Android malware research, omitting its degenerative impact on the performance of machine learning- based classifiers (i.e., concept drift). Besides, the sources of dynamic data and their particularities have been overlooked (i.e., real devices and emulators). Critical factors to take into account when aiming to build more effective, robust, and long-lasting Android malware detection systems. In this research, different sources of benign and malware data are merged, generating a data set encompassing a larger time frame and 489 static and dynamic features are collected. The particularities of the source of the dynamic features (i.e., system calls) are attended using an emulator and a real device, thus generating two equally featured sub-datasets. The main outcome of this research is a novel, labeled, and hybrid-featured Android dataset that provides timestamps for each data sample, covering all years of Android history, from 2008-2020, and considering the distinct dynamic data sources. The emulator data set is composed of 28,745 malicious apps from 209 malware families and 35,246 benign samples. The real device data set contains 41,382 malware, belonging to 240 malware families, and 36,755 benign apps. Made publicly available as KronoDroid, in a structured format, it is the largest hybrid-featured Android dataset and the only one providing timestamped data, considering dynamic sources’ particularities and including samples for over 209 Android malware families.
AB - Android malware evolution has been neglected by the available data sets, thus providing a static snapshot of a non-stationary phenomenon. The impact of the time variable has not had the deserved attention by the Android malware research, omitting its degenerative impact on the performance of machine learning- based classifiers (i.e., concept drift). Besides, the sources of dynamic data and their particularities have been overlooked (i.e., real devices and emulators). Critical factors to take into account when aiming to build more effective, robust, and long-lasting Android malware detection systems. In this research, different sources of benign and malware data are merged, generating a data set encompassing a larger time frame and 489 static and dynamic features are collected. The particularities of the source of the dynamic features (i.e., system calls) are attended using an emulator and a real device, thus generating two equally featured sub-datasets. The main outcome of this research is a novel, labeled, and hybrid-featured Android dataset that provides timestamps for each data sample, covering all years of Android history, from 2008-2020, and considering the distinct dynamic data sources. The emulator data set is composed of 28,745 malicious apps from 209 malware families and 35,246 benign samples. The real device data set contains 41,382 malware, belonging to 240 malware families, and 36,755 benign apps. Made publicly available as KronoDroid, in a structured format, it is the largest hybrid-featured Android dataset and the only one providing timestamped data, considering dynamic sources’ particularities and including samples for over 209 Android malware families.
KW - Android malware
KW - Dataset
KW - Malware analysis
KW - Malware detection
KW - Mobile malware
UR - http://www.scopus.com/inward/record.url?scp=85111254062&partnerID=8YFLogxK
U2 - 10.1016/j.cose.2021.102399
DO - 10.1016/j.cose.2021.102399
M3 - Article
AN - SCOPUS:85111254062
SN - 0167-4048
VL - 110
JO - Computers & Security
JF - Computers & Security
M1 - 102399
ER -