Intrusion Detection Systems

Abstract

Cyber-attacks are becoming more sophisticated and thereby presenting increasing challenges in accurately detecting intrusions. Failure to prevent the intrusions could degrade the credibility of security services, e.g. data confidentiality, integrity, and availability. Numerous intrusion detection methods have been proposed in the literature to tackle computer security threats, which can be broadly classified into Signature-based Intrusion Detection Systems (SIDS) and Anomaly-based Intrusion Detection Systems (AIDS). This survey paper presents a taxonomy of contemporary IDS, a comprehensive review of notable recent works, and an overview of the datasets commonly used for evaluation purposes. It also presents evasion techniques used by attackers to avoid detection and discusses future research challenges to counter such techniques so as to make computer systems more secure.

Source: Ansam Khraisat, Iqbal Gondal, Peter Vamplew, and Joarder Kamruzzaman, https://link.springer.com/article/10.1186/s42400-019-0038-7
This work is licensed under a Creative Commons Attribution 4.0 License.

Introduction

The evolution of malicious software (malware) poses a critical challenge to the design of intrusion detection systems (IDS). Malicious attacks have become more sophisticated and the foremost challenge is to identify unknown and obfuscated malware, as the malware authors use different evasion techniques for information concealing to prevent detection by an IDS. In addition, there has been an increase in security threats such as zero-day attacks designed to target internet users. Therefore, computer security has become essential as the use of information technology has become part of our daily lives. As a result, various countries such as Australia and the US have been significantly impacted by the zero-day attacks. According to the 2017 Symantec Internet Security Threat Report, more than three billion zero-day attacks were reported in 2016, and the volume and intensity of the zero-day attacks were substantially greater than previously (Symantec, 2017). As highlighted in the Data Breach Statistics in 2017, approximately nine billion data records were lost or stolen by hackers since 2013 (Breach_LeveL_Index, 2017). A Symantec report found that the number of security breach incidents is on the rise. In the past, cybercriminals primarily focused on bank customers, robbing bank accounts, or stealing credit cards (Symantec, 2017). However, the new generation of malware has become more ambitious and is targeting the banks themselves, sometimes trying to take millions of dollars in one attack (Symantec, 2017). For that reason, the detection of zero-day attacks has become the highest priority.

High profile incidents of cybercrime have demonstrated the ease with which cyber threats can spread internationally, as a simple compromise can disrupt a business' essential services or facilities. There are a large number of cybercriminals around the world motivated to steal information, illegitimately receive revenues, and find new targets. Malware is intentionally created to compromise computer systems and take advantage of any weakness in intrusion detection systems. In 2017, the Australian Cyber Security Centre (ACSC) critically examined the different levels of sophistication employed by the attackers (Australian, 2017). So there is a need to develop an efficient IDS to detect novel, sophisticated malware. The aim of an IDS is to identify different kinds of malware as early as possible, which cannot be achieved by a traditional firewall. With the increasing volume of computer malware, the development of improved IDSs has become extremely important.

In the last few decades, machine learning has been used to improve intrusion detection, and currently, there is a need for an up-to-date, thorough taxonomy and survey of this recent work. There are a large number of related studies using either the KDD-Cup 99 or DARPA 1999 dataset to validate the development of IDSs; however, there is no clear answer to the question of which data mining techniques are more effective. Secondly, the time taken for building IDS is not considered in the evaluation of some IDSs techniques, despite being a critical factor for the effectiveness of ‘on-line' IDSs.

This paper provides an up to date taxonomy, together with a review of the significant research works on IDSs up to the present time; and a classification of the proposed systems according to the taxonomy. It provides a structured and comprehensive overview of the existing IDSs so that a researcher can become quickly familiar with the key aspects of anomaly detection. This paper also provides a survey of data-mining techniques applied to design intrusion detection systems. The signature-based and anomaly-based methods (i.e., SIDS and AIDS) are described, along with several techniques used in each method. The complexity of different AIDS methods and their evaluation techniques are discussed, followed by a set of suggestions identifying the best methods, depending on the nature of the intrusion. Challenges for the current IDSs are also discussed. Compared to previous survey publications (Patel et al., 2013; Liao et al., 2013a), this paper presents a discussion on IDS dataset problems which are of main concern to the research community in the area of network intrusion detection systems (NIDS). Prior studies such as (Sadotra & Sharma, 2016; Buczak & Guven, 2016) have not completely reviewed IDSs in terms of the datasets, challenges, and techniques. In this paper, we provide a structured and contemporary, wide-ranging study on intrusion detection systems in terms of techniques and datasets; and also highlight challenges of the techniques, and then make recommendations.

During the last few years, a number of surveys on intrusion detection have been published. Table 1 shows the IDS techniques and datasets covered by this survey and previous survey papers. The survey on intrusion detection systems and taxonomy by Axelsson (Axelsson, 2000) classified intrusion detection systems based on the detection methods. The highly cited survey by Debar et al. (Debar et al., 2000) surveyed detection methods based on the behaviour and knowledge profiles of the attacks. A taxonomy of intrusion systems by Liao et al. (Liao et al., 2013a), has presented a classification of five subclasses with an in-depth perspective on their characteristics: Statistics-based, Pattern-based, Rule-based, State-based, and Heuristic-based. On the other hand, our work focuses on the signature detection principle, anomaly detection, taxonomy, and datasets.

Table 1 Comparison of this survey and similar surveys:

(✔: Topic is covered, ✖ the topic is not covered)

Survey	# of citation (as of 6/1/2019)	Intrusion Detection System Techniques						Dataset issue
		SIDS	AIDS				Hybrid IDS
		SIDS	Supervised learning	Unsupervised	Semi-supervised learning	Ensemble methods	Hybrid IDS
Lunt (1988)	219	✔	✖	✖	✖	✖	✖	✖
Axelsson (2000)	1039	✔	✔	✖	✖	✖	✖	✖
Liao, et al. (2013b)	505	✔	✔	✔	✖	✖	✔	✖
Agrawal and Agrawal (2015)	108	✔	✔	✔	✔	✔	✔	✖
Buczak and Guven (2016)	338	✔	✔	✔	✖	✔	✔	✔
Ahmed, et al. (2016)	181	✖	✔	✔	✖	✖	✖	✔
This survey		✔	✔	✔	✔	✔	✔	✔

Existing review articles (e.g., such as (Buczak & Guven, 2016; Axelsson, 2000; Ahmed et al., 2016; Lunt, 1988; Agrawal & Agrawal, 2015)) focus on intrusion detection techniques or dataset issue or type of computer attack and IDS evasion. No articles comprehensively reviewed intrusion detection, dataset problems, evasion techniques, and different kinds of attack altogether. In addition, the development of intrusion-detection systems has been such that several different systems have been proposed in the meantime, and so there is a need for an up-to-date. The updated survey of the taxonomy of intrusion-detection discipline is presented in this paper further enhances taxonomies given in (Liao et al., 2013a; Ahmed et al., 2016).

In view of the discussion on prior surveys, this article focuses on the following:

Classifying various kinds of IDS with the major types of attacks based on intrusion methods.
Presenting a classification of network anomaly IDS evaluation metrics and discussion on the importance of the feature selection.
Evaluation of available IDS datasets discussing the challenges of evasion techniques.

Intrusion detection systems

Intrusion can be defined as any kind of unauthorised activities that cause damage to an information system. This means any attack that could pose a possible threat to the information confidentiality, integrity or availability will be considered an intrusion. For example, activities that would make the computer services unresponsive to legitimate users are considered an intrusion. An IDS is a software or hardware system that identifies malicious actions on computer systems in order to allow for system security to be maintained (Liao et al., 2013a). The goal of an IDS is to identify different kinds of malicious network traffic and computer usage, which cannot be identified by a traditional firewall. This is vital to achieving high protection against actions that compromise the availability, integrity, or confidentiality of computer systems. IDS systems can be broadly categorized into two groups: Signature-based Intrusion Detection System (SIDS) and Anomaly-based Intrusion Detection System (AIDS).

Signature-based intrusion detection systems (SIDS)

Signature intrusion detection systems (SIDS) are based on pattern matching techniques to find a known attack; these are also known as Knowledge-based Detection or Misuse Detection (Khraisat et al., 2018). In SIDS, matching methods are used to find a previous intrusion. In other words, when an intrusion signature matches with the signature of a previous intrusion that already exists in the signature database, an alarm signal is triggered. For SIDS, host's logs are inspected to find sequences of commands or actions which have previously been identified as malware. SIDS have also been labelled in the literature as Knowledge-Based Detection or Misuse Detection (Modi et al., 2013).

Figure 1 demonstrates the conceptual working of SIDS approaches. The main idea is to build a database of intrusion signatures and to compare the current set of activities against the existing signatures and raise an alarm if a match is found. For example, a rule in the form of "if: antecedent -then: consequent" may lead to "if (source IP address=destination IP address) then label as an attack ".

Figure 1 Intrusion Patterns

SIDS usually gives an excellent detection accuracy for previously known intrusions (Kreibich & Crowcroft, 2004). However, SIDS has difficulty in detecting zero-day attacks for the reason that no matching signature exists in the database until the signature of the new attack is extracted and stored. SIDS are employed in numerous common tools, for instance, Snort (Roesch, 1999) and NetSTAT (Vigna & Kemmerer, 1999).

Traditional approaches to SIDS examine network packets and try matching against a database of signatures. But these techniques are unable to identify attacks that span several packets. As modern malware is more sophisticated it may be necessary to extract signature information over multiple packets. This requires the IDS to recall the contents of earlier packets. With regards to creating a signature for SIDS, generally, there have been a number of methods where signatures are created as state machines (Meiners et al., 2010), formal language string patterns or semantic conditions (Lin et al., 2011).

The increasing rate of zero-day attacks (Symantec, 2017) has rendered SIDS techniques progressively less effective because no prior signature exists for any such attacks. Polymorphic variants of the malware and the rising amount of targeted attacks can further undermine the adequacy of this traditional paradigm. A potential solution to this problem would be to use AIDS techniques, which operate by profiling what is an acceptable behavior rather than what is anomalous, as described in the next section.

Anomaly-based intrusion detection system (AIDS)

Signature intrusion detection systems (SIDS) are based on pattern matching techniques to find a known attack; these are also known as Knowledge-based Detection or Misuse Detection (Khraisat et al., 2018). In SIDS, matching methods are used to find a previous intrusion. In other words, when an intrusion signature matches with the signature of a previous intrusion that already exists in the signature database, an alarm signal is triggered. For SIDS, host's logs are inspected to find sequences of commands or actions which have previously been identified as malware. SIDS have also been labelled in the literature as Knowledge-Based Detection or Misuse Detection (Modi et al., 2013).

Figure 1 demonstrates the conceptual working of SIDS approaches. The main idea is to build a database of intrusion signatures and to compare the current set of activities against the existing signatures and raise an alarm if a match is found. For example, a rule in the form of "if: antecedent -then: consequent" may lead to "if (source IP address=destination IP address) then label as an attack ".

Figure 1 Intrusion Patterns

SIDS usually gives an excellent detection accuracy for previously known intrusions (Kreibich & Crowcroft, 2004). However, SIDS has difficulty in detecting zero-day attacks for the reason that no matching signature exists in the database until the signature of the new attack is extracted and stored. SIDS are employed in numerous common tools, for instance, Snort (Roesch, 1999) and NetSTAT (Vigna & Kemmerer, 1999).

Traditional approaches to SIDS examine network packets and try matching against a database of signatures. But these techniques are unable to identify attacks that span several packets. As modern malware is more sophisticated it may be necessary to extract signature information over multiple packets. This requires the IDS to recall the contents of earlier packets. With regards to creating a signature for SIDS, generally, there have been a number of methods where signatures are created as state machines (Meiners et al., 2010), formal language string patterns or semantic conditions (Lin et al., 2011).

The increasing rate of zero-day attacks (Symantec, 2017) has rendered SIDS techniques progressively less effective because no prior signature exists for any such attacks. Polymorphic variants of the malware and the rising amount of targeted attacks can further undermine the adequacy of this traditional paradigm. A potential solution to this problem would be to use AIDS techniques, which operate by profiling what is an acceptable behavior rather than what is anomalous, as described in the next section.

Intrusion data sources

The previous two sections categorised IDS on the basis of the methods used to identify intrusions. IDS can also be classified based on the input data sources used to detect abnormal activities. In terms of data sources, there are generally two types of IDS technologies, namely Host-based IDS (HIDS) and Network-based IDS (NIDS). HIDS inspect data that originates from the host system and audit sources, such as operating system, window server logs, firewalls logs, application system audits, or database logs. HIDS can detect insider attacks that do not involve network traffic (Creech & Hu, 2014a).

NIDS monitors the network traffic that is extracted from a network through packet capture, NetFlow, and other network data sources. Network-based IDS can be used to monitor many computers that are joined to a network. NIDS is able to monitor the external malicious activities that could be initiated from an external threat at an earlier phase, before the threats spread to another computer system. On the other hand, NIDSs have limited ability to inspect all data in a high bandwidth network because of the volume of data passing through modern high-speed communication networks (Bhuyan et al., 2014). NIDS deployed at a number of positions within a particular network topology, together with HIDS and firewalls, can provide a concrete, resilient, and multi-tier protection against both external and insider attacks.

Table 4 shows a summary of comparisons between HIDS and NIDS.

Table 4 Comparison of IDS technology types based on their positioning within the computer system

		Advantages	Disadvantages	Data source
Technology	HIDS	• HIDS can check end-to-end encrypted communications behaviour. • No extra hardware required. • Detects intrusions by checking hosts file system, system calls or network events. • Every packet is reassembled • Looks at the entire item, not streams only	• Delays in reporting attacks • Consumes host resources • Needs to be installed on each host. • It can monitor attacks only on the machine where it is installed.	• Audits records, log files, Application Program Interface (API), rule patterns, system calls.
Technology	NIDS	•Detects attacks by checking network packets. •Not required to install on each host. •Can check various hosts at the same period. •Capable of detecting the broadest ranges of network protocols	•Challenge is to identify attacks from encrypted traffic. •Dedicated hardware is required. •It supports only identification of network attacks. •Difficult to analysis high-speed network. •The most serious threat is the insider attack.	•Simple Network Management Protocol (SNMP) •Network packets (TCP/UDP/ICMP), •Management Information Base (MIB) •Router NetFlow records

Creech et al. proposed a HIDS methodology applying discontinuous system call patterns, with the aim to raise detection rates while decreasing false alarm rates (Creech, 2014). The main idea is to use a semantic structure to kernel level system calls to understand anomalous program behaviour.

As shown in Table 5 a number of AIDS systems have also been applied in Network Intrusion Detection System (NIDS) and Host Intrusion Detection System (HIDS) to increase the detection performance with the use of machine learning, knowledge-based and statistical schemes. Table 5 also provides examples of current intrusion detection approaches, where types of attacks are presented in the detection capability field. Data source comprises system calls, application programme interfaces, log files, data packets obtained from well-known attacks. These data sources can be beneficial to classify intrusion behaviors from abnormal actions.

Table 5 Comparisons of IDS technology types, using examples from the literature. "P" indicates pre-defined attacks and "Z" indicates zero-day attacks

Detection Source			HIDS	NIDS	Capability
Detection methods	SIDS		Wagner and Soto (2002)	Hubballi and Suryanarayanan (2014)	P
	AIDS	Statistics based	Ara, Louzada & Diniz (2017)	Tan, et al. (2014); Camacho, et al. (2016)	Z
		Knowledge-based	Mitchell and Chen (2015) Creech and Hu (2014b)	Hendry and Yang (2008) Shakshuki, et al. (2013) Zargar, et al. (2013)
		Machine learning	Du, et al. (2014) Wang, et al. (2010)	Elhag, et al. (2015); Kim, et al. (2014); Hu, et al. (2014)
	SIDS+ AIDS		Alazab, et al. (2014); Stavroulakis and Stamp (2010); Liu, et al. (2015)		P + Z

Techniques for implementing AIDS

This section presents an overview of AIDS approaches proposed in recent years for improving detection accuracy and reducing false alarms.

AIDS methods can be categorized into three main groups: Statistics-based (Chao et al., 2015), knowledge-based (Elhag et al., 2015; Can & Sahingoz, 2015), and machine learning-based (Buczak & Guven, 2016; Meshram & Haas, 2017). The statistics-based approach involves collecting and examining every data record in a set of items and building a statistical model of normal user behavior. On the other hand, knowledge-based tries to identify the requested actions from existing system data such as protocol specifications and network traffic instances, while machine-learning methods acquire complex pattern-matching capabilities from training data. These three classes along with examples of their subclasses are shown in Fig. 2.

Fig. Classification of AIDS methods

Classification of AIDS methods

Statistics-based techniques

A statistics-based IDS builds a distribution model for normal behaviour profile, then detects low probability events and flags them as potential intrusions. Statistical AIDS essentially takes into account the statistical metrics such as the median, mean, mode, and standard deviation of packets. In other words, rather than inspecting data traffic, each packet is monitored, which signifies the fingerprint of the flow. Statistical AIDS are employed to identify any type of differences in the present behavior from normal behavior. Statistical IDS normally use one of the following models.

Univariate: "Uni” means "one”, so it means the data has only one variable. This technique is used when a statistical normal profile is created for only one measure of behaviours in computer systems. Univariate IDS look for abnormalities in each individual metric (Ye et al., 2002).

Multivariate: It is based on relationships among two or more measures in order to understand the relationships between variables. This model would be valuable if experimental data show that better classification can be achieved from combinations of correlated measures rather than analysing them separately. Ye et al. examine a multivariate quality control method to identify intrusions by building a long-term profile of normal activities (Ye et al., 2002). The main challenge for multivariate statistical IDs is that it is difficult to estimate distributions for high-dimensional data.

Time series model: A time series is a series of observations made over a certain time interval. A new observation is abnormal if its probability of occurring at that time is too low. Viinikka et al. used time series for processing intrusion detection alert aggregates (Viinikka et al., 2009). Qingtao et al. presented a method for detecting network abnormalities by examining the abrupt variation found in time series data (Qingtao & Zhiqing, 2005). The feasibility of this technique was validated through simulated experiments.

Knowledge-based techniques

This group of techniques is also referred toas an expert system method. This approach requires creating a knowledge base which reflects the legitimate traffic profile. Actions which differ from this standard profile are treated as an intrusion. Unlike the other classes of AIDS, the standard profile model is normally created based on human knowledge, in terms of a set of rules that try to define normal system activity.

The main benefit of knowledge-based techniques is the capability to reduce false-positive alarms since the system has knowledge about all the normal behaviors. However, in a dynamically changing computing environment, this kind of IDS needs a regular update on knowledge for the expected normal behavior which is a time-consuming task as gathering information about all normal behaviors is very difficult.

Finite state machine (FSM): FSM is a computation model used to represent and control execution flow. This model could be applied in intrusion detection to produce an intrusion detection system model. Typically, the model is represented in the form of states, transitions, and activities. A state checks the history data. For instance, any variations in the input are noted, and based on the detected variation transition happens (Walkinshaw et al., 2016). An FSM can represent legitimate system behaviour, and any observed deviation from this FSM is regarded as an attack.

Description Language: Description language defines the syntax of rules which can be used to specify the characteristics of a defined attack. Rules could be built by description languages such as N-grammars and UML (Studnia et al., 2018).

Expert System: An expert system comprises a number of rules that define attacks. In an expert system, the rules are usually manually defined by a knowledge engineer working in collaboration with a domain expert (Kim et al., 2014).

Signature analysis: it is the earliest technique applied in IDS. It relies on the simple idea of string matching. In string matching, an incoming packet is inspected, word by word, with a distinct signature. If a signature is matched, an alert is raised. If not, the information in the traffic is then matched to the following signature on the signature database (Kenkre et al., 2015b).

AIDS based on machine learning techniques

Machine learning is the process of extracting knowledge from large quantities of data. Machine learning models comprise of a set of rules, methods, or complex "transfer functions" that can be applied to find interesting data patterns, or to recognise or predict behaviour (Dua & Du, 2016).

Machine learning techniques have been applied extensively in the area of AIDS. Several algorithms and techniques such as clustering, neural networks, association rules, decision trees, genetic algorithms, and nearest neighbour methods, have been applied for discovering the knowledge from intrusion datasets (Kshetri & Voas, 2017; Xiao et al, 2018).

Some prior research has examined the use of different techniques to build AIDSs. Chebrolu et al. examined the performance of two feature selection algorithms involving Bayesian networks (BN) and Classification Regression Trees (CRC) and combined these methods for higher accuracy (Chebrolu et al., 2005).

Bajaj et al. proposed a technique for feature selection using a combination of feature selection algorithms such as Information Gain (IG) and Correlation Attribute evaluation. They tested the performance of the selected features by applying different classification algorithms such as C4.5, naïve Bayes, NB-Tree and Multi-Layer Perceptron (Khraisat et al., 2018; Bajaj & Arora, 2013). A genetic-fuzzy rule mining method has been used to evaluate the importance of IDS features (Elhag et al., 2015). Thaseen et al. proposed NIDS by using Random Tree model to improve the accuracy and reduce the false alarm rate (Thaseen & Kumar, 2013). Subramanian et al. proposed classifying NSL-KDD dataset using decision tree algorithms to construct a model with respect to their metric data and studying the performance of decision tree algorithms (Subramanian et al., 2012).

Various AIDSs have been created based on machine learning techniques as shown in Fig. 3. The objective of using machine learning techniques is to create IDS with improved accuracy and less requirement for human knowledge. In the last few years, the quantity of AIDS which have used machine learning methods has been increasing. A key focus of IDS based on machine learning research is to detect patterns and build intrusion detection system based on the dataset. Generally, there are two kinds of machine learning methods, supervised and unsupervised.

Conceptual working of AIDS approaches based on machine learning

Supervised learning in intrusion detection system

This section presents various supervised learning techniques for IDS. Each technique is presented in detail, and references to important research publications are presented.

Supervised learning-based IDS techniques detect intrusions by using labeled training data. A supervised learning approach usually consists of two stages, namely training and testing. In the training stage, relevant features and classes are identified and then the algorithm learns from these data samples. In supervised learning IDS, each record is a pair, containing a network or host data source and an associated output value (i.e., label), namely intrusion or normal. Next, feature selection can be applied for eliminating unnecessary features. Using the training data for selected features, a supervised learning technique is then used to train a classifier to learn the inherent relationship that exists between the input data and the labelled output value. A wide variety of supervised learning techniques have been explored in the literature, each with its advantages and disadvantages. In the testing stage, the trained model is used to classify the unknown data into intrusion or normal class. The resultant classifier then becomes a model that, given a set of feature values, predicts the class to which the input data might belong. Figure 4 shows a general approach for applying classification techniques. The performance of a classifier in its ability to predict the correct class is measured in terms of a number of metrics is discussed in Section 4.

Fig 4 Classification as the task

There are many classification methods such as decision trees, rule-based systems, neural networks, support vector machines, naïve Bayes and nearest-neighbor. Each technique uses a learning method to build a classification model. However, a suitable classification approach should not only handle the training data, but it should also identify accurately the class of records it has not ever seen before. Creating classification models with reliable generalization ability is an important task of the learning algorithm.

Decision trees: A decision tree comprises of three basic components. The first component is a decision node, which is used to identify a test attribute. The second is a branch, where each branch represents a possible decision based on the value of the test attribute. The third is a leaf that comprises the class to which the instance belongs (Rutkowski et al., 2014). There are many different decision trees algorithms including ID3 (Quinlan, 1986), C4.5 (Quinlan, 2014), and CART (Breiman, 1996).

Naïve Bayes: This approach is based on applying Bayes' principle with robust independence assumptions among the attributes. Naïve Bayes answers questions such as "what is the probability that a particular kind of attack is occurring, given the observed system activities?" by applying conditional probability formulae. Naïve Bayes relies on the features that have different probabilities of occurring in attacks and in normal behavior. Naïve Bayes classification model is one of the most prevalent models in IDS due to its ease of use and calculation efficiency, both of which are taken from its conditional independence assumption property (Yang & Tian, 2012). However, the system does not operate well if this independence assumption is not valid, as was demonstrated on the KDD'99 intrusion detection dataset which has complex attribute dependencies (Koc et al., 2012). The results also reveal that the Naïve Bayes model has reduced accuracy for large datasets. A further study showed that the more sophisticated Hidden Naïve Bayes (HNB) model can be applied to IDS tasks that involve high dimensionality, extremely interrelated attributes, and high-speed networks (Koc et al., 2012).

Genetic algorithms (GA): Genetic algorithms are a heuristic approach to optimization, based on the principles of evolution. Each possible solution is represented as a series of bits (genes) or chromosome, and the quality of the solutions improves over time by the application of selection and reproduction operators, biased to favour fitter solutions. In applying a genetic algorithm to the intrusion classification problem, there are typically two types of chromosome encoding: one is according to clustering to generate binary chromosome coding method; another is specifying the cluster center (clustering prototype matrix) by an integer coding chromosome. Murray et al., has used GA to evolve simple rules for network traffic (Murray et al., 2014). Every rule is represented by a genome and the primary population of genomes is a number of random rules. Each genome is comprised of different genes that correspond to characteristics such as IP source, IP destination, port source, port destination and 1 protocol type (Hoque & Bikas, 2012).

Artificial Neural Network (ANN): ANN is one of the most broadly applied machine-learning methods and has been shown to be successful in detecting different malware. The most frequent learning technique employed for supervised learning is the backpropagation (BP) algorithm. The BP algorithm assesses the gradient of the network's error with respect to its modifiable weights. However, for ANN-based IDS, detection precision, particularly for less frequent attacks, and detection accuracy still need to be improved. The training dataset for less-frequent attacks is small compared to that of more-frequent attacks and this makes it difficult for the ANN to learn the properties of these attacks correctly. As a result, detection accuracy is lower for less frequent attacks. In the information security area, huge damage can occur if low-frequency attacks are not detected. For instance, if the User to Root (U2R) attacks evade detection, a cybercriminal can gain the authorization privileges of the root user and thereby carry out malicious activities on the victim's computer systems. In addition, the less common attacks are often outliers (Wang et al., 2010). ANNs often suffer from local minima and thus learning can become very time-consuming. The strength of ANN is that, with one or more hidden layers, it is able to produce highly nonlinear models that capture complex relationships between input attributes and classification labels. With the development of many variants such as recurrent and convolutional NNs, ANNs are powerful tools in many classification tasks including IDS.

Fuzzy logic: This technique is based on the degrees of uncertainty rather than the typical true or false Boolean logic on which the contemporary PCs are created. Therefore, it presents a straightforward way of arriving at a final conclusion based upon unclear, ambiguous, noisy, inaccurate or missing input data. With a fuzzy domain, fuzzy logic permits an instance to belong, possibly partially, to multiple classes at the same time. Therefore, fuzzy logic is a good classifier for IDS problems as the security itself includes vagueness, and the borderline between the normal and abnormal states is not well identified. In addition, the intrusion detection problem contains various numeric features in the collected data and several derived statistical metrics. Building IDSs based on numeric data with hard thresholds produces high false alarms. An activity that deviates only slightly from a model could not be recognized or a minor change in normal activity could produce false alarms. With fuzzy logic, it is possible to model this minor abnormality to keep the false rates low. Elhag et al. showed that with fuzzy logic, the false alarm rate in determining intrusive actions could be decreased. They outlined a group of fuzzy rules to describe the normal and abnormal activities in a computer system, and a fuzzy inference engine to define intrusions (Elhag et al., 2015).

Support Vector Machines (SVM): SVM is a discriminative classifier defined by a splitting hyperplane. SVMs use a kernel function to map the training data into a higher-dimensioned space so that intrusion is linearly classified. SVMs are well known for their generalization capability and are mainly valuable when the number of attributes is large and the number of data points is small. Different types of separating hyperplanes can be achieved by applying a kernel, such as linear, polynomial, Gaussian Radial Basis Function (RBF), or hyperbolic tangent. In IDS datasets, many features are redundant or less influential in separating data points into correct classes. Therefore, feature selection should be considered during SVM training. SVM can also be used for classification into multiple classes. In the work by Li et al., an SVM classifier with an RBF kernel was applied to classify the KDD 1999 dataset into predefined classes (Li et al., 2012). From a total of 41 attributes, a subset of features was carefully chosen by using the feature selection method.

Hidden Markov Model (HMM): HMM is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unseen data. Prior research has shown that HMM analysis can be applied to identify particular kinds of malware (Annachhatre et al., 2015). In this technique, a Hidden Markov Model is trained against known malware features (e.g., operation code sequence) and once the training stage is completed, the trained model is applied to score the incoming traffic. The score is then contrasted to a predefined threshold, and a score greater than the threshold indicates malware. Likewise, if the score is less than the threshold, the traffic is identified as normal.

K-Nearest Neighbors (KNN) classifier: The k-Nearest Neighbor (k-NN) techniques is a typical non-parametric classifier applied in machine learning (Lin et al., 2015). The idea of these techniques is to name an unlabelled data sample to the class of its k nearest neighbors (where k is an integer defining the number of neighbours to be considered). Figure 5 illustrates a K-Nearest Neighbors classifier where k = 5. The point X represents an instance of unlabelled date which needs to be classified. Amongst the five nearest neighbours of X there are three similar patterns from the class Intrusion and two from the class Normal. Taking a majority vote enables the assignment of X to the Intrusion class.

Fig. 5 An example of classification by K-nearest Neighbor for k=5

k-NN can be appropriately applied as a benchmark for all the other classifiers because it provides a good classification performance in most IDSs (Lin et al., 2015).

Unsupervised learning in intrusion detection system

Unsupervised learning is a form of machine learning technique used to obtain interesting information from input datasets without class labels. The input data points are normally treated as a set of random variables. A joint density model is then created for the data set. In supervised learning, the output labels are given and used to train the machine to get the required results for an unseen data point, while in unsupervised learning, no labels are given, and instead, the data is grouped automatically into various classes through the learning process. In the context of developing an IDS, unsupervised learning means, use of a mechanism to identify intrusions by using unlabelled data to train the model.

As shown in Fig. 6, once records are clustered, all of the cases that appear in small clusters are labelled as an intrusion because the normal occurrences should produce sizable clusters compared to the anomalies. In addition, malicious intrusions and normal instances are dissimilar, thus they do not fall into the identical cluster. Fig 6 Using Clustering for Intrusion Detection

K-means: The K-means techniques are one of the most prevalent techniques of clustering analysis that aims to separate 'n' data objects into 'k' clusters in which each data object is selected in the cluster with the nearest mean. It is a distance-based clustering technique and it does not need to compute the distances between all combinations of records. It applies a Euclidean metric as a similarity measure. The number of clusters is determined by the user in advance. Typically several solutions will be tested before accepting the most appropriate one. Annachhatre et.al. used the K-means clustering algorithm to identify different host behaviour profiles (Annachhatre et al., 2015). They have proposed new distance metrics which can be used in the k-means algorithm to closely relate the clusters. They have clustered data into several clusters and associated them with known behavior for evaluation. Their outcomes have revealed that k-means clustering is a better approach to classify the data using unsupervised methods for intrusion detection when several kinds of datasets are available. Clustering could be used in IDS for reducing intrusion signatures, generate a high-quality signature, or group similar intrusion.

Hierarchical Clustering: This is a clustering technique that aims to create a hierarchy of clusters. Approaches for hierarchical clustering are normally classified into two categories:

Agglomerative- bottom-up clustering techniques where clusters have sub-clusters, which in turn have sub-clusters, and pairs of clusters are combined as one moves up the hierarchy.
Divisive - hierarchical clustering algorithms where iteratively the cluster with the largest diameter in feature space is selected and separated into binary sub-clusters with a lower range.

A lot of work has been done in the area of the cyber-physical control system (CPCS) with attack detection and reactive attack mitigation by using unsupervised learning. For example, a redundancy-based resilience approach was proposed by Alcara (Alcaraz, 2018). He proposed a dedicated network sublayer that has the capability to handle the context by regularly collecting consensual information from the driver nodes controlled in the control network itself, and discriminating view differences through data mining techniques such as k-means and k-nearest neighbour. Chao Shen et al. proposed Hybrid-Augmented device fingerprinting for IDS in Industrial Control System Networks. They used different machine learning techniques to analyse network packets to filter anomaly traffic to detect in the intrusions in ICS networks (Shen et al., 2018).

Semi-supervised learning

Semi-supervised learning falls between supervised learning (with totally labelled training data) and unsupervised learning (without any categorized training data). Researchers have shown that semi-supervised learning could be used in conjunction with a small amount of labelled data classifier's performance for the IDSs with less time and costs needed. This is valuable as for many IDS issues, labelled data can be rare or occasional (Ashfaq et al., 2017).

A number of different techniques for semi-supervised learning have been proposed, such as the Expectation Maximization (EM) based algorithms (Goldstein, 2012), self-training (Blount et al., 2011; Lyngdoh et al., 2018), co-training (Rath et al., 2017), Semi-Supervised SVM (Ashfaq et al., 2017), graph-based methods (Sadreazami et al., 2018), and boosting based semi-supervised learning methods (Yuan et al., 2016).

Rana et al. propose a novel fuzzy-based semi-supervised learning approach by applying unlabelled samples aided with a supervised learning algorithm to enhance the classifier's performance for the IDSs. A single hidden layer feed-forward neural network (SLFN) is trained to output a fuzzy membership vector, and the sample categorization (low, mid, and high fuzziness categories) on unlabelled samples is performed using the fuzzy quantity (Ashfaq et al., 2017). The classifier is retrained after incorporating each category separately into the original training set. Their experimental results using this semi-supervised of intrusion detection on the NSL-KDD dataset show that unlabelled samples belonging to low and high fuzziness groups cause foremost contributions to enhance the accuracy of IDS contrasted to traditional.

Performance metrics for IDS

There are many classification metrics for IDS, some of which are known by multiple names. Table 6 shows the confusion matrix for a two-class classifier which can be used for evaluating the performance of an IDS. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

IDS are typically evaluated based on the following standard performance measures:

True Positive Rate (TPR): It is calculated as the ratio between the number of correctly predicted attacks and the total number of attacks. If all intrusions are detected then the TPR is 1 which is extremely rare for an IDS. TPR is also called a Detection Rate (DR) or the Sensitivity. The TPR can be expressed mathematically as

\(T P R=\dfrac{T P}{T P+F N}\)
False Positive Rate (FPR): It is calculated as the ratio between the number of normal instances incorrectly classified as an attack and the total number of normal instances.

\(F P R=\dfrac{F P}{F P+T N}\)
False Negative Rate (FNR): False negative means when a detector fails to identify an anomaly and classifies it as normal. The FNR can be expressed mathematically as:

\(F N R=\dfrac{F N}{F N+T P}\)
Classification rate (CR) or Accuracy: The CR measures how accurate the IDS is in detecting normal or anomalous traffic behavior. It is described as the percentage of all those correctly predicted instances to all instances:

\(Accuracy=\dfrac{T P+T N}{T P+T N+F P+F N}\)

Receiver Operating Characteristic (ROC) curve: ROC has FPR on the x-axis and TPR on the y-axis. In ROC curve the TPR is plotted as a function of the FPR for different cut-off points. Each point on the ROC curve represents a FPR and TPR pair corresponding to a certain decision threshold. As the threshold for classification is varied, a different point on the ROC is selected with different False Alarm Rate (FAR) and different TPR. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). The ROC Curve is shown in Fig. 7.

Fig 7 ROC Curve

Ensemble methods

Multiple machine learning algorithms can be used to obtain better predictive performance than any of the constituent learning algorithms alone. A number of different ensemble methods have been proposed, such as Boosting, Bagging, and Stacking.

Boosting refers to a family of algorithms that are able to transform weak learners to strong learners. Bagging means training the same classifier on different subsets of same dataset. Stacking combines various classification via a meta-classifier (Aburomman & Reaz, 2016). The base-level models are built based on a whole training set, then the meta-model is trained on the outputs of the base level model as attributes.

Jabbar et al. proposed an ensemble classifier that is built using Random Forest and also the Average One-Dependence Estimator (AODE which solves the attribute dependency problem in Naïve Bayes classifier. Random Forest (RF) enhances precision and reduces false alarms (Jabbar et al., 2017). Combining both approaches in an ensemble results in improved accuracy over either technique applied independently.

Hybrid based techniques

Traditional IDSs have limitations: that they cannot be easily modified, inability to identify new malicious attacks, low accuracy, and high false alarms. Where AIDS has a limitation such as high false-positive rate. Hybrid IDS is based on the combination of SIDS and AIDS. A Hybrid IDS overcomes the disadvantage of SIDS and AIDS. Farid et al. (Farid et al., 2010) proposed hybrid IDS by using Naive Bayes and decision tree based and achieved detection rate of 99.63% on the KDD’99 dataset.

Intrusion detection datasets

The evaluation datasets play a vital role in the validation of any IDS approach, by allowing us to assess the proposed method’s capability in detecting intrusive behavior. The datasets used for network packet analysis in commercial products are not easily available due to privacy issues. However, there are a few publicly available datasets such as DARPA, KDD, NSL-KDD, and ADFA-LD, and they are widely used as benchmarks. Existing datasets that are used for building and comparative evaluation of IDS are discussed in this section along with their features and limitations.

DARPA / KDD Cup99

The earliest effort to create an IDS dataset was made by DARPA (Defence Advanced Research Project Agency) in 1998 and they created the KDD98 (Knowledge Discovery and Data Mining (KDD)) dataset. In 1998, DARPA introduced a programme at the MIT Lincoln Labs to provide a comprehensive and realistic IDS benchmarking environment (MIT Lincoln Laboratory, 1999). Although this dataset was an important contribution to the research on IDS, its accuracy and capability to consider real-life conditions have been widely criticized (Creech & Hu, 2014b).

These datasets were collected using multiple computers connected to the Internet to model a small US Air Force base of restricted personnel. Network packets and host log files were collected. Lincoln Labs built an experimental testbed to obtain 2 months of TCP packets dump for a Local Area Network (LAN), modelling a usual US Air Force LAN. They modelled the LAN as if it were a true Air Force environment, but interlaced it with several simulated intrusions.

The collected network packets were around four gigabytes containing about 4,900,000 records. The test data of 2 weeks had around 2 million connection records, each of which had 41 features and was categorized as normal or abnormal.

The extracted data is a series of TCP sessions starting and ending at well-defined times, between which data flows to and from a source IP address to a target IP address, which contains a large variety of attacks simulated in a military network environment. The 1998 DARPA Dataset was used as the basis to derive the KDD Cup99 dataset which has been used in Third International Knowledge Discovery and Data Mining Tools Competition (KDD, 1999). The 41 features of the KDD Cup99 dataset are presented in Table 7.

These datasets are out-of-date as they do not contain records of recent malware attacks. For example, attackers' behaviors are different in different network topologies, operating systems, and software and crime toolkits. Nevertheless, KDD99 remains in use as a benchmark within IDS research community and is still presently being used by researchers (Alazab et al., 2014; Duque & Omar, 2015; Ji et al., 2016).

CAIDA

This dataset contains network traffic traces from Distributed Denial-of-Service (DDoS) attacks, and was collected in 2007 (Hick et al., 2007). This type of denial-of-service attack attempts to interrupt normal traffic of a targeted computer, or network by overwhelming the target with a flood of network packets, preventing regular traffic from reaching its legitimate destination computer. One disadvantage of the CAIDA dataset is that it does not contain a diversity of the attacks. In addition, the gathered data does not contain features from the whole network which makes it difficult to distinguish between abnormal and normal traffic flows.

NSL-KDD

NSL-KDD is a public dataset, which has been developed from the earlier KDD cup99 dataset (Tavallaee et al., 2009). A statistical analysis performed on the cup99 dataset raised important issues that heavily influence the intrusion detection accuracy and results in a misleading evaluation of AIDS (Tavallaee et al., 2009).

The main problem in the KDD data set is the huge amount of duplicate packets. Tavallaee et al. analyzed KDD training and test sets and revealed that approximately 78% and 75% of the network packets are duplicated in both the training and testing dataset (Tavallaee et al., 2009). This huge quantity of duplicate instances in the training set would influence machine-learning methods to be biased towards normal instances and thus prevent them from learning irregular instances that are typically more damaging to the computer system. Tavallaee et al. built the NSL-KDD dataset in 2009 from the KDD Cup'99 dataset to resolve the matters stated above by eliminating duplicated records (Tavallaee et al., 2009). The NSL-KDD train dataset consists of 125,973 records and the test dataset contains 22,544 records. The size of the NSL-KDD dataset is sufficient to make it practical to use the whole NSL-KDD dataset without the necessity to sample randomly. This has produced consistent and comparable results from various research works. The NSL_KDD dataset comprises 22 training intrusion attacks and 41 attributes (i.e., features). In this dataset, 21 attributes refer to the connection itself and 19 attributes describe the nature of connections within the same host (Tavallaee et al., 2009).

ISCX 2012

In this dataset, real network traffic traces were analyzed to identify normal behaviour for computers from real traffic of HTTP, SMTP, SSH, IMAP, POP3, and FTP protocols (Shiravi et al., 2012). This dataset is based on realistic network traffic, which is labeled and contains diverse attacks scenarios.

ADFA-LD and ADFA-WD

Researchers at the Australian Defence Force Academy created two datasets (ADFA-LD and ADFA-WD) as public datasets that represent the structure and methodology of the modern attacks (Creech, 2014). The datasets contain records from both Linux and Windows operating systems; they are created from the evaluation of system-call-based HIDS. Ubuntu Linux version 11.04 was used as the host operating system to build ADFA-LD (Creech & Hu, 2014b). Some of the attack instances in ADFA-LD were derived from new zero-day malware, making this dataset suitable for highlighting differences between SIDS and AIDS approaches to intrusion detection. It comprises three dissimilar data categories, each group of data containing raw system call traces. Each training dataset was gathered from the host for normal activities, with user behaviors ranging from web browsing to LATEX document preparation. Table 8 shows some of the ADFA-LD features with the type and the description for each feature.

Table 8 Features of ADFA-LD dataset (Creech, 2014)

Name	Type	Description
srcip	nominal	Source IP address
sport	integer	Source port number
dstip	nominal	Destination IP address
dsport	integer	Destination port number
proto	nominal	Transaction protocol
state	nominal	Indicates to the state and its dependent protocol
dur	Float	Record total duration
sbytes	Integer	Source to destination transaction bytes
dbytes	Integer	Destination to source transaction bytes
sttl	Integer	Source to destination time to live value
dttl	Integer	Destination to source time to live value
sloss	Integer	Source packets retransmitted or dropped
dloss	Integer	Destination packets retransmitted or dropped
service	nominal	http, ftp, smtp, ssh, dns, ftp-data ,irc and (-) if not much used service
Sload	Float	Source bits per second
Dload	Float	Destination bits per second
Spfcts	integer	Source to destination packet count
Dpkts	integer	Destination to source packet count
swin	integer	Source TCP window advertisement value
dwin	integer	Destination TCP window advertisement value
stcpb	integer	Source TCP base sequence number
dtcpb	integer	Destination TCP base sequence number
smeansz	integer	Mean of the how packet size transmitted by the src
dmeansz	integer	Mean of the how packet size transmitted by the dst
trans_depth	integer	Represents the pipelined depth into the connection of http request response transaction
resbdvlen	integer	Actual uncompressed content size of the data transferred from the server's http service

ADFA-LD also incorporates system call traces of different types of attacks. The ADFA Windows Dataset (ADFA-WD) provides a contemporary Windows dataset for evaluation of HIDS. Table 9 shows the number of systems calls for each category of AFDA-LD and AFDA-WD Table 10 describes details of each attack class in the ADFA-LD dataset. Table 11 lists the ADFA-WD Vectors and Effects.

Table 9 Number of system calls traces in different categories of AFDA-LD and AFDA-WD

ADFA- LD			ADFA-WD
Dataset	Traces	System Calls	Traces	System Calls
Training data	833	308,077	355	13,504,419
Validation data	4372	2,122,085	1827	117,918,735
Attack data	746	317,388	5542	74,202,804
Total	5951	2,747,550	7724	205,625,958

Table 10 ADFA-LD attack class

Attack	Payload	Vector	Count
Hydra-FTP	Password brute force	FTP by Hydra	162
Hydra-SSH	Password brute force	SSH Hydra	176
Adduser	Add new super user	Client-side poisoned executable	91
Java-Meterpreter	Java based Meterpreter	TIkiWiki vulnerability exploit	124
Meterpreter	Linux Meterpreter Payload	Client side poisoned executable	75
Webshell	C100 Webshell	PHP remote file inclusion vulnerability	118

Table 11 ADFA-WD Vectors and Effects

Vectors

TCP ports - Web-based vectors;
Browser attacks - Malware attachments.

Effects

Effects - Bind Shell - Reverse shell - Exploitation
Remote operation - Staging - System manipulation
Privilege escalation - Data exfiltration -Back-door insertion

CICIDS 2017

CICIDS2017 dataset comprises both benign behaviour and also details of new malware attacks: such as Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack, Infiltration, Botnet, and DDoS (Sharafaldin et al., 2018). This dataset is labelled based on the timestamp, source and destination IPs, source and destination ports, protocols, and attacks. A complete network topology was configured to collect this dataset which contains Modem, Firewall, Switches, Routers, and nodes with different operating systems (Microsoft Windows (like Windows 10, Windows 8, Windows 7, and Windows XP), Apple's macOS iOS, and open-source operating system Linux). This dataset contains 80 network flow features from the captured network traffic.

Comparison of public IDS datasets

Since machine learning techniques are applied in AIDS, the datasets that are used for the machine learning techniques are very important to assess these techniques for realistic evaluation. Table 12 summarises popular public data sets, as well as some analysis techniques and results for each dataset from prior research. Table 13 summarizes the characteristics of the datasets.

Table 12 Comparison of results achieved by various methods on publically available IDS datasets

Dataset	Result	Observations	Reference
DARPA 98	Snort's detection, 69% of total generated alerts are considered to be false alarms.	SIDS is applied without AIDS	Hu, et al. (2009)
	ANN analysis system calls, 96% detection rate.	A classifier based on artificial neural network (ANN) has been executed for preparing and testing of framework.	McHugh (2000)
	SVM on subset of DARPA 98, 99.6% detection rate.	SVM isolates information into various classes by a hyperplane or hyperplanes since it can deal with multidimensional information. SVM usually demonstrate good performance for a binary class problem.	Chen, et al. (2005)
KDDCUP 99	Multivariate statistical analysis of audit data, 90% detection rate	Multivariate is used to reduce false alarm rates.	Ye, et al. (2002), Hotta, et al. (2008)
	The best results have been achieved by the C4.5 algorithm which attains the 95% true positive rate.	The decision trees created by C4.5 can be utilized for classification	Ferrari and Cribari-Neto (2004); Shafi and Abbass (2013); Laskov, et al. (2005)
	SMO classifier 97% detection rate.	This SVM based classifier with SMO implementation produces good detection accuracy. However, the accuracy reported is less than that in (Chen et al., 2005), because the KDDCUP 99 dataset is more complex and comprehensive than DARPA 98 dataset.	Shafi and Abbass (2013)
	The best model is an HNB model, where 95% confidence level is used to compare the models.	Hidden Naïve Bayes (HNB) techniques could be applied to IDS area that suffer from dimensionality, highly associated attributes and high network speed. HNB technique is better than the one based on the traditional NB method in terms of detection accuracy for IDS.	Koc, et al. (2012)
NSL-KDD	K-Nearest Neighbour (k-NN) algorithm, the detection rate of 94%.	The k-NN algorithm uses all labelled training instances as a model of the target function. During the classification phase, k-NN uses a similarity-based search strategy to determine a locally optimal hypothesis function.	Adebowale, et al. (2013)
	Naïve Bayes, the detection rate is 89%.	Bayesian classifiers provide moderate accuracy because the focus is on classifying the classes for the instances, not the exact probabilities.	Adebowale, et al. (2013)
	C4.5 gave the best detection rate of 99%.	C4.5 selects the feature of the data that most efficiently divides its set of samples into subsets, contributing to improved accuracy	Thaseen and Kumar (2013)
	SMO classifier, the detection rate is 97%.	The work also uses SVM based classifier and achieves detection rate similar to (Chen et al., 2005).	Adebowale, et al. (2013)
	Expectation Maximization (EM) clustering, the accuracy is 78%	EM forms a "soft" task of each row to various clusters in percentage to the probability of each cluster. The accuracy in this method is low as EM does not give a parameter covariance matrix for standard errors	Ahmed, et al. (2016)
ADFA-WD	Creech et al. have used Hidden Markov Model (HMM), Extreme Learning Machine (ELM) and SVM. They reported 74.3% accuracy for HMM, 98.57% accuracy for ELM and 99.64% accuracy for SVM.	The ADFA-WD is a much new data set and contains new attacks. This is why reported accuracy was not as good as for every machine learning technique when compared to the accuracy using legacy KDD98 data. SVM has been reported to produce the highest accuracy.	Creech and Hu (2014b)
ADFA-LD	100% accuracy for using ELM using original semantic feature	New semantic features are applied. Therefore, ELM, are capable to use the new semantic feature easily and quickly by including amounts of semantic phrases.	Creech and Hu (2014b)
CICIDS2017	94.5% accuracy obtained by using MLP solely, by using MLP and Payload Classifier together 95.2% accuracy rate is detected.	Feature selection is done by using Fisher Score algorithm.	Usteba, et al. (2018)
Bot-IoT	The highest accuracy from the SVM model. 98% detection rate	This SVM based method has produced good detection accuracy (Mitchell & Chen, 2015; Chen et al., 2005; Ferrari & Cribari-Neto, 2004)	Koroniotis, et al. (2018)

Table 13 Compassion of datasets (✔ = True, ✖ = False)

Dataset	Realistic Traffic	Label data	IoT traces	Zero-day attacks	Full packet captured	Year
DARPA 98	✔	✔	✖	✖	✔	1998
KDDCUP 99	✔	✔	✖	✖	✔	1999
CAIDA	✔	✖	✖	✖	✖	2007
NSL-KDD	✔	✔	✖	✖	✔	2009
ISCX 2012	✔	✔	✖	✖	✔	2012
ADFA-WD	✔	✔	✖	✔	✔	2014
ADFA-LD	✔	✔	✖	✔	✔	2014
CICIDS2017	✔	✔	✖	✔	✔	2017
Bot-IoT	✔	✔	✔	✔	✔	2018

Feature selection for IDS

Feature selection is helpful to decrease the computational difficulty, eliminate data redundancy, enhance the detection rate of the machine learning techniques, simplify data and reduce false alarms. In this line of research, some methods have been applied to develop a lightweight IDSs.

Feature selection techniques can be categorized into wrapper and filter methods. Wrapper methods estimate subgroups of variables to identify the feasible interactions between variables. There are two main drawbacks of these techniques: accumulative overfitting when the amount of data is insufficient and the important calculation time when the amount of variables is big.

Filter methods are normally applied as a pre-processing stage. The selection of features is separate of any machine learning techniques. As an alternative, features are nominated on the basis of their scores in several statistical tests for their correlation with the consequence variable.

As an example of the impact of feature selection on the performance of an IDS, consider the results in Table 14 which show the detection accuracy and time to build the IDS mode of the C4.5 classifier using the full dataset with 41 features of NSl-KDD dataset and with different features.

Table 14 Detailed accuracy for C4.5 Decision tree classifier with different feature sets

Filter techniques	# of features	Accuracy	Time
Full set	41	99.55	2.76 Sec
Info Gain	13	99.64	0.84 Sec
Gain ratio	13	99.64	1.31 Sec
Chi-squared	13	99.65	0.92 Sec
Relief	13	99	0.93 Sec

Types of computer attacks

Cyber-attacks can be categorized based on the activities and targets of the attacker. Each attack type can be classified into one of the following four classes (Sung & Mukkamala, 2003):

Denial-of-Service (DoS) attacks have the objective of blocking or restricting services delivered by the network, computer to the users.
Probing attacks have the objective of acquisition of information about the network or the computer system.
User-to-Root (U2R) attacks have the objective of a non-privileged user acquiring root or admin-user access on a specific computer or a system on which the intruder had user-level access.
Remote-to-Local (R2L) attacks involve sending packets to the victim machine. The cybercriminal learns the user's activities and obtains privileges which an end-user could have on the computer system.

Within these broad categories, there are many different forms of computer attacks. A summary of these attacks with a brief explanation, characteristics, and examples are presented in Table 15.

Table 15 Classes of computer attacks

Types of Attack	Explanation	Example
Buffer Overflow	Attacks the buffer's boundaries and overwrites memory area.	Long URL strings are a common input. Cowan, et al. (1998)
Worm	Reproduces itself on the local host or through the network.	SQL Slammer, Mydoom, CodeRed Nimda.
Trojan	Programs appear attractive and genuine, but have malicious code embedded inside them.	Zeus, SpyEye Alazab, et al. (2013)
Denial of service (DoS)	A security event to disrupt the network services. It is started by forcing reset on the target computers. The users can no longer connect to the system because of unavailability of service.	Buffer overflow, Ping of death (PoD), TCP SYN, smurf, teardrop Zargar, et al. (2013)
Common Gateway Interface (CGI) Scripts	The attacker takes advantage of CGI scripts to create an attack by sending illegitimate inputs to the web server.	Phishing email; Aljawarneh (2016)
Traffic Flooding	Attacks the limited size of NIDS to handle huge traffic loads and to investigate for possible intrusions. If a cybercriminal can cause congestion in the networks, then NIDS will be busy in analyzing the traffic.	Denial of Service (Dos) or Distributed Denial of Service (DDoS) Zargar, et al. (2013)
Physical Attack	Aims to attack the physical mechanisms of the computer system.	Cold boot, evil maid (Pasqualetti et al., 2013).
Password Attack	Aims to break the password within a small time, and is noticed by a sequence of failures login.	A dictionary attack, Rainbow attack (Das et al., 2014).
Information Gathering	Gathers information or finds weaknesses in computers or networks by sniffing or searching.	System scan, port scan, (Bou-Harb et al., 2014).
User to Root (U2R) attack	The cybercriminal accesses as a normal user in the beginning and then upgrades to a super-user which may lead to exploitation of several vulnerabilities of the system.	Intercept packets, rainbow attack, social engineering Rootkit, load module, (Perl Raiyn, 2014).
Remote to Local (R2L) attack	The cybercriminal sends packets to a remote system by connecting to the network without having an account on the system.	Warezclient, ftp write, multihop,phf, spy, warezmaster, imap (Raiyn, 2014).
Probe	Identifying the valid IP addresses by scanning the network to gather host data packets.	Sweep, portsweep (So-In et al., 2014)

IDS evasion techniques

This section discusses the techniques that a cybercriminal may use to avoid detection by IDS such as Fragmentation, Flooding, Obfuscation, and Encryption. These techniques pose a challenge for the current IDS as they circumvent existing detection methods.

Fragmentation

A packet is divided into smaller packets. The fragmented packets are then be reassembled by the recipient node at the IP layer before forwarding it to the Application layer. To examine fragmented traffic correctly, the network detector needs to assemble these fragments similarly as it was at fragmenting point. The restructuring of packets needs the detector to hold the data in memory and match the traffic against a signature database. Methods used by attackers to escape detection by hiding attacks as legitimate traffic are fragmentation overlap, overwrite, and timeouts (Ptacek & Newsham, 1998; Kolias et al., 2016). Fragmentation attack replaces information in the constituent fragmented packets with new information to generate a malicious packet.

Fig 8 Fragment - Overwrite

Figure 8 shows the fragment overwrite. Packet Fragment 3 is generated by the attacker. The network intrusion detector must retain the state for all of the packets of the traffic which it is detecting.

The duration of time that the detector can maintain a state of traffic might be smaller than the period that the destination host can maintain a state of traffic (Xiong et al., 2017). The malware authors try to take advantage of any shortcoming in the detection method by delivering attack fragments over a long time.

Flooding

The attacker begins the attack to overwhelm the detector and this causes a failure of control mechanism. When the detector fails, all traffic would be allowed (Kolias et al., 2016). A popular method to create a flooding situation is spoofing the legitimate User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP). The traffic flooding is used to disguise the abnormal activities of the cybercriminal. Therefore, IDS would have extreme difficulty to find malicious packets in a huge amount of traffic.

Obfuscation

Obfuscation techniques can be used to evade detection, which are the techniques of concealing an attack by making the message difficult to understand (Kim et al., 2017). The terminology of obfuscation means changing the program code in a way that keeps it functionally identical with the aim to reduce detectability to any kind of static analysis or reverse engineering process and making it obscure and less readable. This obfuscation of malware enables it to evade current IDS.

Obfuscation attempts to utilize any limitations in the signature database and its capability to duplicate the way the computer host examines computer's data (Alazab & Khresiat, 2016). An effective IDS should be supporting the hexadecimal encoding format or having these hexadecimal strings in its set of attack signatures (Cova et al., 2010). Unicode/UTF-8 standard permits one character to be symbolized in several various formats. Cybercriminals may also use double-encoded data, exponentially escalating the number of signatures required to detect the attack.

SIDS relies on signature matching to identify malware where the signatures are created by human experts by translating a malware from machine code into a symbolic language such as Unicode. However, the use of code obfuscation is very valuable for cybercriminals to avoid IDSs.

Encryption

Generally, encryption offers a number of security services, such as data confidentiality, integrity, and privacy. Malware authors employ these security attributes to escape detection and conceal attacks that may target a computer system. For example, attacks on encrypted protocols such as HyperText Transfer Protocol Secure (HTTPS) cannot be read by an IDS (Metke & Ekl, 2010). The IDS cannot match the encrypted traffic to the existing Database signatures if it doesn't interpret the encrypted traffic. Therefore, examining encrypted traffic makes it difficult for detectors to detect attacks (Butun et al., 2014). For example, packet content-based features have been applied extensively to identify malware from normal traffic, which cannot readily be applied if the packet is encrypted.

These challenges motivate investigators to use some statistical network flow features, which do not rely on packet content (Camacho et al., 2016). As a result of this, malware can potentially be identified from normal traffic.

Challenges of IDS

Although there has been a lot of research on IDSs, many essential matters remain. IDSs have to be more accurate, with the capability to detect a varied ranging of intrusions with fewer false alarms and other challenges.

Challenges of IDS for ICSs

Industrial Control Systems (ICSs) are commonly comprised of two components: Supervisory Control and Data Acquisition (SCADA) hardware which receives information from sensors and then controls the mechanical machines; and the software that enables human administrators to control the machines.

Cyber attacks on ICSs are a great challenge for the IDS due to the unique architectures of ICSs as the attackers are currently focusing on ICSs. A standout amongst the recent attacks against ICSs is the Stuxnet attack, which is known as the first cyber-warfare weapon. Dissimilar to a typical attack, the primary target of Stuxnet was probably the Iranian atomic program (Nourian & Madnick, 2018). Attacks that could target ICSs could be state-sponsored or they might be launched by the competitors, internal attackers with a malicious target, or even hacktivists.

The potential consequences of compromised ICS can be devastating to public health and safety, national security, and the economy. Compromised ICS systems have led to extensive cascading power outages, dangerous toxic chemical releases, and explosions. It is therefore important to use secure ICSs for reliable, safe, and flexible performance.

It is critical to have IDS for ICSs that takes into account unique architecture, real-time operation, and dynamic environment to protect the facilities from the attacks. Some critical attacks on ICSs are given below:

In 2008, Conficker malware infected ICS systems, such as an aeroplane's internal systems. Conficker disables many security features and automatic backup settings, erases stored data, and opens associations to get commands from a remote PC (Pretorius & van Niekerk, 2016).
In 2009, a 14-year-old schoolboy hacked the city's tram system and used a homemade remote device to redirect a number of trams, injuring 12 passengers (Rege-Patwardhan, 2009).
In 2017, WannaCry ransomware spread globally and seriously affected the National Health System, UK, and prevented emergency clinic specialists from using health systems (Mohurle & Patil, 2017).

Since Microsoft no longer creates security patches for legacy systems, they can simply be attacked by new types of ransomware and zero-day malware.

Similarly, it may not be possible to fix or update the operating systems of ICSs for legacy applications.

A robust IDS can help industries and protect them from the threat of cyber attacks. Unfortunately, current intrusion detection techniques proposed in the literature focus at the software level. A vital detection approach is needed to detect the zero-day and complex attacks at the software level as well as at the hardware level without any previous knowledge. This can be done by integrating both hardware and software intrusion detection systems and extracting useful features of both HIDS and NIDS.

Challenge of IDS on intrusion evasion detection

Detecting attacks masked by evasion techniques is a challenge for both SIDS and AIDS. The ability of evasion techniques would be determined by the ability of IDS to bring back the original signature of the attacks or create new signatures to cover the modification of the attacks. Robustness of IDS to various evasion techniques still needs further investigation. For example, SIDS in regular expressions can detect the deviations from simple mutation such as manipulating space characters, but they are still useless against a number of encryption techniques.

Discussion and conclusion

Cybercriminals are targeting computer users by using sophisticated techniques as well as social engineering strategies. Some cybercriminals are becoming increasingly sophisticated and motivated. Cybercriminals have shown their capability to obscure their identities, hide their communication, distance their identities from illegal profits, and use infrastructure that is resistant to compromise. Therefore, it becomes increasingly important for computer systems to be protected using advanced intrusion detection systems that are capable of detecting modern malware. In order to design and build such IDS systems, it is necessary to have a complete overview of the strengths and limitations of contemporary IDS research.

In this paper, we have presented, in detail, a survey of intrusion detection system methodologies, types, and technologies with their advantages and limitations. Several machine learning techniques that have been proposed to detect zero-day attacks are reviewed. However, such approaches may have the problem of generating and updating the information about new attacks and yield high false alarms or poor accuracy. We summarized the results of recent research and explored the contemporary models on the performance improvement of AIDS as a solution to overcome IDS issues.

In addition, the most popular public datasets used for IDS research have been explored and their data collection techniques, evaluation results, and limitations have been discussed. As normal activities are frequently changing and may not remain effective over time, there exists a need for newer and more comprehensive datasets that contain a wide spectrum of malware activities. A new malware dataset is needed, as most of the existing machine learning techniques are trained and evaluated on the knowledge provided by the old dataset such as DARPA/ KDD99, which do not include newer malware activities. Therefore, testing is done using these dataset collected in 1999 only, because they are publicly available and no other alternative and acceptable datasets are available. While widely accepted as benchmarks, these datasets no longer represent contemporary zero-day attacks. Though ADFA dataset contains many new attacks, it is not adequate. For that reason, testing of AIDS using these datasets does not offer a real evaluation and could result in inaccurate claims for their effectiveness.

This study also examines four common evasion techniques to determine their ability to evade the recent IDSs. An effective IDS should be able to detect different kinds of attacks accurately including intrusions that incorporate evasion techniques. Developing IDSs capable of overcoming the evasion techniques remains a major challenge for this area of research.

Site:	Saylor Academy
Course:	CS406: Information Security
Book:	Intrusion Detection Systems

Printed by:	Guest user
Date:	Thursday, 22 May 2025, 5:21 AM