The development of Internet and network applications has brought the development of encrypted communication technology. But on this basis, malicious traffic also uses encryption to avoid traditional security protection and detection. Traditional security protection and detection methods cannot accurately detect encrypted malicious traffic. In recent years, the rise of artificial intelligence allows us to use machine learning and deep learning methods to detect encrypted malicious traffic without decryption, and the detection results are very accurate. At present, the research on malicious encrypted traffic detection mainly focuses on the characteristics' analysis of encrypted traffic and the selection of machine learning algorithms. In this paper, a method combining natural language processing and machine learning is proposed; that is, a detection method based on TF-IDF is proposed to build a detection model. In the process of data preprocessing, this method introduces the natural language processing method, namely, the TF-IDF model, to extract data information, obtain the importance of keywords, and then reconstruct the characteristics of data. The detection method based on the TF-IDF model does not need to analyze each field of the data set. Compared with the general machine learning data preprocessing method, that is, data encoding processing, the experimental results show that using natural language processing technology to preprocess data can effectively improve the accuracy of detection. Gradient boosting classifier, random forest classifier, AdaBoost classifier, and the ensemble model based on these three classifiers are, respectively, used in the construction of the later models. At the same time, CNN neural network in deep learning is also used for training, and CNN can effectively extract data information. Under the condition that the input data of the classifier and neural network are consistent, through the comparison and analysis of various methods, the accuracy of the one-dimensional convolutional network based on CNN is slightly higher than that of the classifier based on machine learning.
ASJC Scopus subject areas
- Information Systems
- Computer Networks and Communications