Научная статья на тему 'MACHINE LEARNING ALGORITHMS ANALYSIS FOR NETWORK TRAFFIC CLASSIFICATION'

MACHINE LEARNING ALGORITHMS ANALYSIS FOR NETWORK TRAFFIC CLASSIFICATION Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
57
10
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
Network classification / traffic / Machine learning / algorithms.

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Tojieva Feruza, Khamdamov Utkir

The rapid growth of Internet services has increased the demand for network traffic classification. There are several network traffic analysis methods available today. One of these methods is the Machine learning method used in the analysis of encrypted traffic. In this article was analyzed four different Machine learning algorithms to classify different internet traffics. This article has studied classification performance parameters as classification accuracy, recall, precision and training time. Bayes Network algorithm has given better performance with classification accuracy and training time as compared other machine learning algorithms.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «MACHINE LEARNING ALGORITHMS ANALYSIS FOR NETWORK TRAFFIC CLASSIFICATION»

Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil

"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year

Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год

MACHINE LEARNING ALGORITHMS ANALYSIS FOR NETWORK TRAFFIC

CLASSIFICATION

Tojieva Feruza,

Tashkent University of Information Technologies after Muhmmad ibn Musa al-Khwarizmi, PhD

student [email protected]

Khamdamov Utkir

Tashkent University of Information Technologies after Muhmmad ibn Musa al-Khwarizmi, professor

[email protected]

Abstract. The rapid growth of Internet services has increased the demand for network traffic classification. There are several network traffic analysis methods available today. One of these methods is the Machine learning method used in the analysis of encrypted traffic. In this article was analyzed four different Machine learning algorithms to classify different internet traffics. This article has studied classification performance parameters as classification accuracy, recall, precision and training time. Bayes Network algorithm has given better performance with classification accuracy and training time as compared other machine learning algorithms.

|| Keywords: Network classification, traffic, Machine learning, algorithms.

Introduction. Today, the variety of services provided by the Internet creates new opportunities for businesses and organizations around the world to move to a new stage of development. The need for Internet services is increasing day by day. From the exchange of information in social networks to the search for necessary information in global networks, all actions are related to the Internet. Applications such as YouTube, Netflix and MegaUpload have completely changed the structure of the Internet. Accordingly, the number of Internet users is also increasing sharply. According to ITU (International Telecommunication Union) statistical research in 2023, the number of Internet users is 5.4 billion, which is 67% of the world's population. This indicator increased by 4.7% compared to 2022, and 3.5% in 2021 compared to 2022 [1].

The rapid increase in the number of Internet users creates complexities for network administrators and Internet service providers to manage and monitor the ever-increasing amount of Internet traffic. At the same time, due to the increase in the types of services offered through the Internet, the problems arising in ensuring network security and effective network

management can be effectively solved by classifying and filtering network traffic. Traffic classification and filtering is an important and critical factor in improving network performance.

The conducted analyzes show that many methods, tools and methodologies have been developed to improve network performance, effectively classify and filter traffic passing through the network. Based on the research work of the above scientists, the following are the problems of network traffic classification and filtering, and they are:

First of all, the ever-increasing requirements for user data encryption and privacy have led to a sharp increase in the volume of encrypted traffic on the modern Internet. The encryption process converts the primary information into an arbitrary format in order to complicate the decryption. As a result, encrypted information does not have the appropriate patterns to identify network traffic. Consequently, accurate classification and filtering of encrypted traffic has become a real challenge in modern networks.

Second, many proposed approaches to network traffic classification, such as machine learning and

301

Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil

"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year

Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год

statistics-based methods and payload inspection, require experts to independently extract patterns or symbols. This process is prone to errors and takes a lot of time and money. Finally, many ISPs block P2P file sharing applications. This is due to their high bandwidth and copyright issues. Therefore, to solve this problem, these applications use protocol injection and obfuscation to bypass traffic management systems. Identification of these types of applications is one of the most complex issues in network traffic classification and filtering.

Methods. Port-based Technique is one of the first methods to classify network traffic was port-based. Default port numbers are assigned to network protocols approved by the Internet Addressing Authority (IANA). The port-based approach is mainly based on identifying popular applications or IANA-registered protocols based on their port numbers. In conclusion, it can be said that the main advantages of the port-based approach are their simplicity, low computing resources, and the availability of high-speed network traffic classification and filtering. The disadvantages are that port-based approaches have low accuracy in classifying and filtering network traffic, as a result of the widespread use of constantly changing dynamic ports.

Next technique is Payload-Based Technique. This method is also called Deep Packet Inspection technique (DPI)[6]. In this technique, the contents of the packets are examined looking characteristics signatures of the network applications in the traffic. This is the first alternative to ports- based method. This technique is specially proposed for Peer to Peer (P2P) applications. But in this technique, we stumble upon some problems. The first problem in this technique is that it needs a very expensive hardware for pattern searching in a payload. The second problem in this technique is that it does not work in encrypted network application traffic. Finally, this approach needs continuous update of signature pattern of new applications[2].

Today, the increase in the number of encrypted traffic on the Internet, the constant appearance of new applications and the increase in data speed have

changed the demand for data processing. Machine learning method was developed to solve existing problems. ML uses different algorithms to analyze data and make decisions and make predictions based on that learning. These algorithms can be divided into three parts. They are: supervised, unsupervised and semisupervised learning. In this article we discuss supervised ML algorithms.

Today, there is a large number of studies using Supervised machine learning algorithms to classify network traffic. This article will consider some of them.

Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem. The algorithm considers the presence or absence of features as independent of each other, which enables it to provide good performance on a small number of instansets[3].

Bayesian network is used to highlight the variables along with their relationships as Probabilistic Graphical Model. The network design consists of continuous or discrete variable nodes, and the edges of the network demonstrates the connection between these nodes[4].

Random Forest applies an ensemble approach in building the classification model. Contrary to a single decision tree, RF builds multiple classifiers for the classification problem, which helps to provide a strong classifier from several weak individual classifiers.

C4.5 is decision tree learning algorithm that provides a top-down structure tree with an iterative division of the training dataset. A node in the trees donates a feature, a branch donates a possible value and a leaf represents a class label[3].

Results. We used in this article real time internet traffic datasets. This Dataset is taken from Computer Laboratory of University of Cambridge [7]. This Dataset contains variety of features to identify flows. This includes simple statistics about packet length and inter-packet timings, and information derived from the transport protocol: such as SYN and ACK counts. This information is provided based on all packets (both directions) and on each direction individually (server ^ client and client ^ server).

302

Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil

"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year

Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год

Many packet statistics taken by counting packets, and packet header-sizes. A significant number of features are derived from the TCP headers.

The main problem of high training time or model building time by using Dataset 1 is solved by reducing number of features characterizing each internet application. In order to develop reduced feature datasets, Correlation based Feature Selection Algorithm and Consistency based Feature Selection Algorithm of Weka tool have been used . These reduced feature datasets is named as Dataset2Correlation and Dataset2Consistency respectively.

In all these datasets, eight internet applications classes are mainly taken into account such as HTTP, MAIL, FTP-CONTROL, FTP-PASV, ATTACK, P2P, DATABASE, FTP-DATA, MULTIMEDIA, SERVICES, INTERACTIVE and GAMES.

There are 24863 data samples in Datasets. In Datasets each application sample is characterized by 248 features which mainly consist of minimum, maximum, mean, variance and total values of packets, average packets per second, packet size, duration, conversations for Ethernet and TCP protocol conversations. From these dataset, a reduced feature datasets are also developed using CFS(Correlation based Feature Selection Algorithm) and CON(Consistency based Feature Selection Algorithm)[5].

Experimentation of machine learning traffic classification. This article used Weka tool for implementing IP traffic classification with four different ML algorithms namely: C.45, Native Net, Native Bayer and Random Forest.

The algorithms used in this study are simple to implement and have either few or no parameters to be tuned. They also produce classifications models that can be more easily interpreted.

Three different internet traffic datasets namely, Dataset 1, Dataset2Correlation and Dataset2Consistency consisting of 24863 data samples, are used in this article. Datasets are divided into two sets consisting of data samples for training and data samples for testing purpose in both cases.

This article has studied classification performance parameters as classification accuracy, recall, precision and training time.

ML Bayes C4.5 Rando Naïve

Classifier Net m Bayes

s Forest

Accuracy 94.135 99.674 57.905

(%) 9 2 98.8779 3

Training

Time 0.55

(Seconds) 7.37 11.13 2.42

Recall 0.941 0.997 0.989 0,579

Precision 0.985 0.997 0.989 0.957

Table 1. Classification accuracy, training time recall and precision of four ML classifiers for Dataset 1

120

■ BayesNet ■ c.45 ■ Random Tree NaweBayes

Figure 1. Classification Accuracy of three ML Classifiers for Dataset 1

First we used full feature Dataset1 which consist of 248 attributes. Then we reduced attributes using feature reduction algorithms. We used CFS and CON algorithms as feature reduction algorithms. Function of feature reduction algorithms is to select the most relevant aspects in the dataset. And using these datasets, which are developed by CFS and CON, we implemented IP traffic classification with different ML algorithms.

Following you can see detailed accuracy by classes for each algorithm.

303

Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil

"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year

Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN2181-4252 Том: 1 | Выпуск: 2 | 2024 год

TP Rate

0,995

0,953

0,546

0,651

0,582

0,867

0,575

0,556

0,755

0,995

0,000

FP Rate

0,014

0,001

0,000

0,000

0,002

0,001

0,001

0,000

0,001

0,000

0,000

0,000

Precision

0,995

0,553

0,522

0,757

0,559

0,899

0,535

0,555

0,325

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

0,986

0,000

Recall

0,995

0,553

0,546

0,651

0,582

0,867

0,575

0,556

0,755

0,995

0,000

F-Measure

0,995

0,553

0,534

0,700

0,570

0,883

0,557

0,555

0,750

0,990

0,000

HCC

0,981

0,551

0,533

0,702

0,568

0,881

0,556

0,555

0,750

0,990

-0,000

ROC Area

0,990

0,556

0,573

0,825

0,789

0,936

0,587

0,553

0,863

0,998

0,500

PRC Area

0,993

0,587

0,872

0,453

0,333

0,782

0,516

0,551

0,640

0,981

0,000

Claas

ИШ

MAIL

FTP-CONTROL FTP-PASV ATTACK P2P

DATABASE

FTP-DATA

MULTIMEDIA

SERVICES

INTERACTIVE

GÄMES

Figure 2. Table detailed accuracy by class of datasetl for random tree algorithm

TP Rate

0,933

0,576

0,580

0,977

0,803

0,770

0,583

0,586

0,977

0,550

0,667

FP Rate

0,005

0,001

0,003

0,008

0,034

0,003

0,000

0,000

0,005

0,000

0,003

0,000

Precision

0,998

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

0,557

0,645

0,176

0,103

0,801

0,583

0, 555

0,405

0,545

0,024

Recall

0,933 0,576 0,580 0,977 0,803 0,770 0,583 0,586 0,977 0,550 0,667

F-Measure

0,964

0,586

0,781

0,299

0,183

0,785

0,583

0,553

0,572

0,565

0,047

MCC

0,884

0,534

0,756

0,413

0,280

0,782

0,583

0,552

0,627

0,565

0,127

ROC Area

0,998

0,555

0,555

0,997

0,543

0,986

1,000

1,000

0,998

1,000

0,965

PRC Area

0,999

0,557

0,505

0,260

0,183

0,784

0,555

0,555

0,570

0,566

0,056

Class

WW

MAIL

FTP-CONTROL

FTP-PASV

ATTACK

P2P

DATABASE

FTP-DATA

MULTIMEDIA

SERVICES

INTERACTIVE

GAMES

Figure 3. Table type styles detailed accuracy by class of Datasetl for Naive Bayes algorithm

TP Rate

0,533

0,576

0,580

0,577

0,803

0,770

0,583

0,586

0,577

0,550

0,667

FP Rate

0,005

0,001

0,003

0,008

0,034

0,003

0,000

0,000

0,005

0,000

0,003

0,000

Precision

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

0,553

0,557

0,645

0,176

0,103

0,801

0, 583

0,555

0,405

0,545

0,024

Recall

0,533

0,576

0,580

0,577

0,803

0,770

0,583

0,586

0,577

0,550

0,667

F-Measure

0,564

0,536

0,781

0,255

0,183

0,785

0,583

0,553

0,572

0,565

0,047

MCC

0,334

0,534

0,756

0,413

0,230

0,782

0,533

0,552

0,627

0,565

0,127

ROC Area

0,558

0,555

0,555

0,557

0,548

0,586

1,000

1,000

0,558

1,000

0,565

PRC Area

0,555

0,557

0,505

0,260

0,183

0,784

0,555

0,555

0,570

0,566

0,056

Class

WWW

MAIL

FTP-CONTROL FTP-PASV ATTACK P2P

DATABASE

FTP-DATA

MULTIMEDIA

SERVICES

INTERACTIVE

GAMES

Figure 4. Table type styles detailed accuracy by class of Datasetl for Naïve Net algorithm

№ Feature Name

1. The total number of ack packets seen carrying TCP SACK blocks (serversclient)

2. The count of all the packets seen with the PUSH bit set in the TCP header. (serversclient)

3. The Maximum Segment Size (MSS) requested as a TCP option in the SYN packet opening the connection. (clientsserver)

4. The total number of bytes sent in the initial window i.e., the number of bytes seen in the initial flight of data before receiving the first ack packet from the other endpoint. Note that the ack packet from the other endpoint is the first ack acknowledging some data and any retransmitted packets in this stage are excluded. (serversclient)

5. The missed data, calculated as the difference between the ttl stream length and unique bytes

sent. If the connection was not complete, this calculation is invalid and an "NA" (Not Available) is printed. (clients-server)_

The maximum number of retransmissions seen for any segment during the lifetime of the connection. (clients-server)_

Minimum of control bytes in packet

Table 2. list of CFC features

It is clear that classification accuracy of Bayes Net improves with reduced feature data set using CFS as compared to full feature data set. Also the training time for Bayes Net reduces from 12.91 to 0.23 seconds only. C4.5 classifier's accuracy has been a little reduced. And training time of C4.5 algorithm is over two times large than Bayes Net's. Therefore it is clear that Bayes Net performance is improved by reducing the features using CFS algorithm. Also performance of Naïve Bayes has improved with CFS algorithm.

Now the original feature set is subjected to CON feature reduction method and 7 features are obtained as mentioned in Table 3.

Classification accuracy, training time, Recall and Precision values for five machine learning classifiers for CON feature reduction algorithm. It is clear that performance of CON is not better than CFS as well as full feature data set performance in terms of classification accuracy. Also training has been grown as compared to CFS. And classification accuracy of Naïve Bayes algorithm has given very poor result 11,2858%.

From this analysis, it is evident that Bayes Net is a very good classifier for classification of various internet applications. Also Correlation based Feature Selection algorithm has given high classification accuracy and low training time compared to both Full feature dataset and Consistency based Feature Selection algorithm. Training time of Naïve Bayes algorithm is the least time too but its classification accuracy very poor at using full feature dataset and Consistency based Feature Selection algorithm.

304

Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 I Son: 2 | 2024-yil

"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year

Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

№ Feature Name

1 First quartile inter-arrival time

2 Variance in packet inter-arrival time

3 Mean of total bytes in IP packet

4 The total number of packets seen (servers-client)

5 The minimum window advertisement seen. This is the minimum window-scaled advertisement seen if both sides negotiated window scaling. (clients-server)

6 The minimum RTT sample seen. (clientsserver)

7 The average value of RTT found, calculated straightforward-ly as the sum of all the RTT values found divided by the total number of RTT samples.(clientsserver)

Table 3. List of CON features

Conclusions. Traffic management will still be an important problem. Internet network will support a greater variety of applications. Different applications require different treatment by the network. The intelligence to recognize different traffic and their requirements implies the need for sophisticated traffic management.

In this work analyses of supervised Machine Learning traffic classification algorithms: Bayes Network, Naive Bayes, Forest Tree and C4.5 have been made in weka tool. Dataset which consist of internet traffic data samples related to various internet applications (HTTP, MAIL, FTP-CONTROL, FTP-PASV, ATTACK, P2P, DATABASE, FTP-DATA, MULTIMEDIA, SERVICES, INTERACTIVE and GAMES), is used. We have analyzed classification performance parameters: Classification Accuracy, Recall, Precision and training time. Bayes Network algorithm has given better performance with classification accuracy and training time as compared other machine learning algorithms. We have got good performance results when we used Correlation based Feature Selection algorithm. Bayes Net is an efficient machine learning technique for classification of internet traffic.

References

1. A.Madhukar, C.Williamson, A longitudinal study of p2p traffic classification, in: 14th IEEE International Symposium on Modeling, and Simulation, 2006, pp.179-188. doi: 10.1109/MASCOTS.2006.

2. Nahlah Abdulrahman Alkhalidi, Fouad A.Yaseen, "FDPHI: Fast Deep Packet Header Inspection for Data Traffic Classification and management" International Journal of Intelligent Engineering & Systems 2021.

3. Muhammad Shafiq, Xiangzhan Yu, Asif Ali Laghari, "Network Traffic Classification techniques and comparative analysis using Machine Learning algorithms,"Conference: 2016 2nd IEEE International Conference on Computer and Communications (ICCC). D0I:10.1109/CompComm.2016.7925139.

4. Ahmad Azab, Mahmoud Khasavneh, Saed Alrabaee, Kim-Kwang Raymond Choo, Maysa Sarsour, "Network traffic classification: Techniques, datasets, and challenges," Digital Communication and Networks(2022), doi: https://doi.org/10.1016/ldcan.2022.09.009.

5. Muhammad Sameer Sheikh, Yinqiao Peng, "Procedures, Criteria, and Machine Learning Techniques for Network Traffic Classification: A Survey," 2022, IEEE Access, p. 6113561158.

6. P.Khandait, N.Hubbali, B.Mazumdar, Efficient keyword matching for deep packet inspection based network traffic classification, in:2020 International Conference on communication Systems & Networs (COMSNETS), IEEE, 2020, pp. 567-570.

7. https://www.cl.cam.ac.uk/research/srg/netos/pr oj ects/archive/nprobe/data/ papers/index.html

8. N.Hubbali, M.Swarnkar, M.Conti, Bitprob: probabilistic bit signatures for accurate application identification, IEEE Transactions on Network and Service management 17 (3) (2020).

305

i Надоели баннеры? Вы всегда можете отключить рекламу.