Abstract: With the tremendous growth of network-based services and sensitive information on networks, network security is getting more and more importance than ever. Intrusion poses a serious security risk in a network environment. The ever growing new intrusion types have a serious problem for their detection. In this paper, we apply one of the efficient data mining algorithms called Very Fast Decision Tree (VFDT) for anomaly based network intrusion detection. Experimental results on the Knowledge Discovery Data Mining KDDcup’99 data set show that our approach is highly effective in detecting network intrusion. It is observed that the proposed technique performs better in terms of detection rate, false alarm rate, and computational time compared with other techniques.
Keywords: Very Fast Decision Tree Algorithm, Classifier, Network Security, Intrusion Detection System, KDD CUP99 dataset.
In our society, information is becoming increasingly dependent on rapid access and interactive processing. As this demand increases, more information is being stored on computers. The proliferation of inexpensive computers and computer networks has worsened the problem of unauthorized access and tampering with data. With an increased understanding of how systems work, intruders have become highly skilled at diagnosing weaknesses in systems, and exploiting them to obtain such privileges that they can do anything on the system. They also use patterns that are difficult to trace and identify. Computer systems are therefore not likely to remain safe in the nearest future because of the intruder’s acts. Therefore, we must have measures in place to detect security breaches, i.e., identify intruders and their methods of intrusion .
An Intrusion Detection System (IDS) inspects the activities in a system for suspicious behavior or patterns that may indicate system attack or misuse. There are two main categories of intrusion detection techniques; Anomaly detection and Misuse detection . The former analyses the information gathered and compares it to a defined baseline of what is seen as “normal” service behavior, so it has the ability to learn how to detect network attacks that are currently unknown. Misuse Detection is based on signatures for known attacks, so it is only as good as the database of
attack signatures that it uses for comparison. Misuse detection has low false alarm rate, but cannot detect novel attacks. However, anomaly detection can detect unknown attacks, but has a bit higher false alarm rate. In this paper, we review the performance of classifiers when trained to identify signatures of specific attacks.
The simulated attacks are classified as follows :
Denial of Service Attack (DoS): this is an attack which makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users access to a machine.
Probing Attack (prob): this is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.
User to Root Attack (U2R): this is a class of exploit in which the attacker starts out with the access to a normal user’s account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and is able to exploit some vulnerability to gain root access to the system.
Remote to Local Attack (R2L): it occurs when an attacker has the ability to send packets to a machine over a network, but does not have an account on that machine to exploit some vulnerability to gain local access as a user of that machine.
In , Audit Data Analysis and Mining (ADAM) is an intrusion detector system built to detect intrusions using data mining techniques. It first absorbs training data known to be free of attacks. Next, it uses an algorithm to group attacks, unknown behavior, and false alarms. ADAM has several useful capabilities, namely;
- Classifying an item as a known attack
- Classifying an item as a normal event,
- Classifying an item as an unknown attack,
The Intrusion Detection using Data Mining Technique (IDDM)  is a real-time network IDS (NIDS) for misuse and anomaly detection. It applies association rules, Meta rules, and characteristic rules. It employs data mining to produce description of network data and uses this information for deviation analysis. Mining Audit Data for Automated Models for Intrusion Detection (MADAM ID)  is one of the best known data mining projects in intrusion detection. It is an off-line IDS to produce anomaly and misuse intrusion detection models. Association rules and frequent episodes are applied in MADAM ID to replace hand-coded intrusion patterns and profiles with the learned rules.
In , the author proposes a method of intrusion detection using an evolving fuzzy neural network. This type of learning algorithm combines Artificial Neural Network (ANN) and Fuzzy Inference Systems (FIS), as well as evolutionary algorithms. They create an algorithm that uses fuzzy rules and allow new neurons to be created in order to accomplish this. They use Snort to gather data for training the algorithm and then compare their technique with that of an augmented neural network.
In , a statistical neural network classifier for anomaly detection is developed, which can identify UDP flood attacks. Comparing different neural network classifiers, the Back Propagation Neural Network (BPN) has shown to be more efficient in developing IDS. In , the author uses the back propagation method by Sample Query and Attribute Query for the Intrusion Detection, whereby analyzing and identifying the most important components of training data. It could reduce processing time, storage requirement, etc.
In , the authors combine multiple host-based detectors using decision tree. They use conventional measures for intrusion detection and modeling methods appropriate to each measure. System calls, resource usage and file access events are used to measure user’s behavior and Hidden Markov model. Statistical Rule-base method is used to model these measures which are combined with diction tree. The proposed detection method is expected to have a better performance because it can model normal behaviors from various perspectives. The decision tree used here C4.5 algorithm.
This paper gives a comparative study of several anomaly detection schemes for identifying novel network intrusion detections. We present experimental results on Knowledge Discovery Data Mining KDDCup’99 data set. The experimental results have demonstrated that using the VFDT classifier model is much more efficient in the detection of network intrusions, compared with other classification techniques.
The remainder of this paper is organized as follow. Section II presents the proposed Network-based Anomaly Detection System (NADS), and also gives general information about IDS. Section III presents the experimental results. Finally, section IV concludes the paper.
II. THE PROPOSED NADS SYSTEM
The proposed NADS system has the possibility of working in either online or offline modes. This depends on the type of features used to train and test the system. The online mode works on the basic features of an extraction using the header of packets only. (e.g. protocol type, service, flag, src_bytes, dst_bytes). These features are organized as connection records to enter into the model directly to determine the nature of the connection as soon as possible, and give the appropriate response. The offline mode works on the basic statistical features extracted from the header and the content of the packets. The statistical features are determined when the session is terminated, and they are organized as connection records. A large number of packets are needed to identify accurate connections, and then store them in the file in the form of connection records in order to be entered into the classified model to test their classes as normal. This mode gives accurate results, but at the expense of time. Both modes requires adequate connection information to train the proposed model. Therefore the system working update gets information at any time to obtain a sufficient number of connections to build a decision tree for a classification of attacks. VFDT is a high-performance data mining algorithm based on Hoeffding trees. Many classification learning methods have been proposed, of which the decision tree learning method is commonly used, because it is fast and the description of classifiers that it derives is easily understood. As data arrives, this data stream grows gradually while the data is classified . VFDT allows the use of either information gain or the Gini index as attribute evaluation measure. It includes a number of refinements to the algorithm . VFDT does not accumulate examples in main memory, because it can gradually grow without waiting for the arrival of all the examples .
The architecture of the Network-based Anomaly Detection System (NADS) is shown in the Figure 1. Although different NADS approaches exist, but all of them has these basic modules or stages :
Parameterization: The representation of the observed instances in the target system as a pre-established from.
Training stage: The process of educating the system and grouping the connections as normal or abnormal, and the building of the corresponding model.
Detection stage: A comparison between the models for the system built with the parameterized or observed traffic. If any deviation found exceeds a given threshold, an alarm will be announced.
Fig.1. Network-based NADS Anomaly Detection System
III. EXPERIMENTS AND RESULTS
In this section, we summarize our experimental results in detecting network intrusion using the VFDT algorithm over KDDCup’99 data set. We first describe the data set used in the experiments and then discuss the results obtained. Finally, we evaluate our approach and compare the results with the results obtained by other researchers.
We carried out four experiments on KDD CUP99 dataset in order to get the best results. The conditions of these experiments are as follows:
Experiment 1: We used the whole KDD CUP99 dataset for training, but we selected 20 features only.
Experiment 2: We used the whole KDD CUP99 dataset for training, but we choose 15 features.
Experiment 3: We used 10% of the KDD CUP99 dataset for training, and we selected 15 or 20 features.
Experiment 4: We used the whole KDD CUP99 dataset for training, but we choose only the basic features.
The Confusion Matrix (CM) shown in Table 1 is used to measure of performances of IDS systems and it has the following entries:
True Positive (TP): Number of connections that were correctly classified as attacks.
True Negative (TN): Number of connections that were correctly classified as normal.
False Positive (FP): Number of normal connections that were classified as attacks.
False Negative (FN): Number of attack connections that were classified as normal.
Table 1: Confusion Matrix (CM) .
The performance metrics used are: the detection rate (DR), the false alarm rate (FAR), the total training time (TT), and the true positive percentage (TP%).
The DR is the percentage of the truly (correctly) classified attacks as shown in equation (1). The FAR is the percentage of wrongly classified normal connections as shown in equation (2). The TT is the total time used for training of the data set.
For the attacks DoS, probe, U2R, and R2L, the True Positive percentage (TP %) is defined as the number of the related attack TP’s over the total number of TP's.
The results of these experiments can be seen in Table 2.
The first experimental result indicated that the proposed NADS system achieved a DR percent of 93.83 % for attacks by using the VFDT algorithm, whereas the achieved FAR rate is 0.608%. In addition, the TP% for DoS, Probe, U2R, and R2L is 93.10%, 79.04%, 22.84%, and 36.64% respectively.
The result of the second experiment is close to the first experiment, although the detection rate for U2R becomes zero.
The third experimental result indicated that the proposed NADS system achieved a DR percent of 95.53 % and a FAR percent of 0.993%. In addition, the TP% for DoS, Probe, U2R, and R2L is 96.4%, 24.2%, 75.6%, and 81.6% respectively.
The fourth experimental result indicated that the proposed NADS system achieved a DR percent of 90.34 % and a FAR rate of 0.78%. In addition, the TP% for DoS, Probe, U2R, and R2L is 92.67%, 75.4%, 0%, and 31.32% respectively.
Table 2: Performance metrics for the four experiments
The following parameters are discussed to clarify our results precisely in terms of system accuracy, classification speed, and memory allocation.
A. System Accuracy Results Receiver Operating Characteristic (ROC) which is a procedure, derived from statistical decision theory that was developed in the context of electronic signal detection. It is used to evaluate the predictive ability of classifiers. ROCs are plotted in coordinates which are spanned by the rates of DR, and FAR classification.
The proposed NADS system has a high accuracy performance when compared with other systems. The ROC curve of the system is shown in Figure 2. The goal is to detect many attacks while minimizing the generation of FAR. It is clear from Figure 2 that our system is able to detect most of the attacks with a DR rate of 93.83%, and at a low FAR rate of 0.608%.
Fig.2.The ROC curve for the NADS system
B. Classification Speed Results Figure 3 shows the training time versus number of connections. The NADS system uses 100,000 connections at time 4.09s, whereas it uses1000, 000 connections at time 38.97s.
C. Memory Allocation Results Figure 4 shows the number of nodes which are created by using the VFDT algorithm. The unexpected increasing at the beginning of the chart represents the beginning of the construction of a tree. The sufficient number of connections is collected to be able to classify all types of attacks in the same tree. It continues in increasing gradually, while the number of nodes is almost stable.
Fig.3. Training time versus number of connections
Fig.4. Number of created nodes versus number of connections
Finally, table 2. Shows the comparison of the proposed system with other systems in terms of DR and TT. The results indicated that the proposed NADS system achieved a high classification accuracy rate of 93% by using the VFDT algorithm. It is the highest performance compared with all other algorithm except the Genetic Programming (GP) algorithm where it has a higher DR rate of 98%. The speed of training (building and testing) didn't exceed 39.88 seconds for the proposed system, whereas it is in terms of hours for others systems.
In this paper, we have proposed a Network-based Anomaly Detection System (NADS) which helps to take insidious attacks under control. This technique works in two modes, on-line and offline modes. The proposed NADS system uses the Very Fast Decision Tree algorithm (VFDT), and it classifies the connections as normal or abnormal. It is worth mentioning that the proposed NADS system outperforms other schemes in terms of training time per example since it has the lowest time. The proposed system was able to detect most of the attacks with a DR rate of 93.83%, and at a low FAR rate of 0.608%.
As a future work, this study can be extended in different aspects such as:
The proposed NADS system used the VFDT algorithm for detection. Hence, it can be analyzed using other detection algorithms as well.
As the proposed NADS system uses “anomaly detection technique”, it can be modified to use the misuse technique as well.
 Sandeep k., “classification and detection of computer intrusions”, PhD Thesis, Purdue University, August 1995.
 Ali A., Wei L and Mahbod T., “Network Intrusion detection and prevention concepts and Techniques”, Springer Media, New York LLC, 2010.
 Terry E., “intrusion detection: network security Beyond the Firewall,” John Wiley and sons, New York, USA, 1998.
 Andrew G. Blank, “TCP/IP Jumpstart: internet protocol Basics”,2nd Edition, john wiley and Sons, 2002.
 Hamdan o. Alanazi, Rafidah Md. Noor, B, b. Zaidan and A. A. Zaidan,”intrusion detection system: Overview”, Journal of computing, Vol. 2, Issue 2, pp.32-48, February 2010.
 Tasuya M. Ayahiko N., “ detection of Fraud use of credit card by Extended VFDT ,” IEEE Transactions on knowledge and data Engineering(TKDE), Vol. 21, No. 11, pp. 1505-1514, 2011.
 Zhi-Hua Z., Hang L. and Qiang Y., “Advances is Knowledge Discovery and Data Mining,” 11th Pacific-Asia Conference, PAKDD, Springer, China, Vol. 4426, 2007.
 Wenke L., Salvatore J. Stolfo and Kui W. Mok, “A Data Mining Framework for Building intrusion detection Models,“ IEEE Symposium on Security and Privacy, pp. 120-132, 1999.
 Pang-Ning T., Michael S. and Vipin K., “introduction to data mining”, 2nd Edition, Addison Wesley, March 2006.
 Sang-Jun H., sung-Bae C., “combining Multiple Host-Based Detectors using decision tree,” 16th Australian conference on Artificial intelligence, Springer, Vol. 2903, pp.208-220, 2003.
Muhammed SALAMAH received the B.S., M.S., and PhD. degrees in Electrical and Electronics Engineering from Middle East Technical University in 1988, 1990, and 1995, respectively. From 1988 to 1996, he worked as a Research Assistant and Lecturer in METU. In 1995, he worked as a visiting researcher in the Networks Department of Ecole Nationale Superieure des Telecommunications in Paris, France. Since 1996, he has been with Department of Computer Engineering at Eastern Mediterranean University, TRNC, Turkey, where he is an Assoc. Professor. He served as the Assistant Chairman of the Computer Engineering Department of EMU from 1997 to 2004; the chair of the Student Project and Design Center in EMU from 2005 to 2008; Rectorate Coordinator for Student Affairs and Informatics from 2008 to 2010; head of the Computer Engineering Department of EMU from 2010 to 2013, and since 2013 he has been serving as the Rector's Advisor for Middle East in EMU. His current research interests are computer networks and wireless networks with emphasis on resource allocation and management, performance evaluation, multiprocessing, and hardware-oriented algorithms.
Naseer Alwan Hussein received the MS degree in Computer Engineering from Eastern Mediterranean University, TRNC, Turkey, in 2014. His current research interests are in wireless. Communications, molecular communications, bio inspired communications, and nano networks.