PhD Seminar – Tarfa Hamed
Join us Wednesday, June 29, 2016, at 2:00pm in Reynolds 219 for a Seminar by Ph.D. Candidate Tarfa Hamed.
Embedded Feature Selection for Interdependent Features with Application to Intrusion Detection
Intrusion detection is a significant component in achieving a secured information system. One challenge that faces Intrusion Detection Systems (IDS) is to recognize new threats quickly. Machine learning can be employed to distinguish between malevolent and normal activity. However, in order to apply machine learning, relevant features need to be identified and presented to a learning system (e.g. SVM). Typically, there are many features, but only few examples (especially of emerging threats). Judicious feature selection can help the classifier learn to distinguish between threats and normal activity. Recursive Feature Elimination (RFE) is an established and very well-known feature selection method. It has been successfully used since it was proposed by Guyon et al. in 2002. However, there are reports in the literature that this method has a limitation surrounding interdependent features. Interdependent features are features that individually provide no useful information to the decision to be made, but are useful in combination. This apparent limitation sparked our curiosity to investigate it experimentally. To do that, we applied RFE to a simple problem, which is an embedded version of the XOR problem. We noticed that RFE was failing to solve the XOR problem and was eliminating one of the relevant features instead of keeping it. Therefore, in this work, we are proposing a novel feature selection method that overcomes the limitation of RFE. The new method is called Recursive Feature Addition (RFA) and it is a form of embedded feature selection. Similar to RFE, the RFA method is built upon Support Vector Machines (SVMs) as a core classifier and the method works in a recursive manner by adding one feature at a time in contrast to RFE which removes one feature at a time. The RFA has been applied on the XOR problem and it showed much better performance than RFE. In addition to the synthetic XOR data set, RFA and RFE have been tested on real-world data sets. The real-world data sets have been chosen to have a large number of features (i.e. most of them are microarray data sets) and relatively few number of examples. The performance of the proposed method has been evaluated using four metrics: accuracy, and F-measure of the SVM classifier and two other joint metrics. The RFA method was statistically superior to RFE in both accuracy and F-measure in all of the synthetic XOR data sets. The RFA was also statistically superior in terms of accuracy in most of the real-world data sets and provided better F-measure on all real-world-data sets. Our work will then apply the proposed RFA feature selection to an intrusion detection problem. We chose the most recent intrusion detection data set which is ISCX 2012 data set to apply our proposed feature selection on. We will evaluate the performance of RFA on this data set using the SVM classifier after applying feature selection.
Advisor: Dr. Stefan Kremer
Advisory Committee: Dr. Rozita Dara