Project work done for the IP1302 course of our 5th Semester.
Reference Paper: MCAD: A Machine Learning Based Cyberattacks Detector in Software-Defined Networking (SDN) for Healthcare Systems [1]
- Ashwin Santhosh (CS22B1005, GitHub: ash0545)
- Aswin Valsaraj (CS22B1006, GitHub: aswinn03)
- Kaustub Pavagada (CS22B1042, GitHub: Kaustub26Pvgda)
Our overall goal was to delve into the creation of complex Software Defined Network (SDN) topologies - through the use of the Ryu SDN Framework, Mininet and VirtualBox Virtual Machines (VMs) - and the collection of flow statistics through Ryu's inbuilt Application Programming Interface (API) - ofctl_rest
. Through this, we simulated various attack and normal traffic flows within our emulated network to make a dataset, which was used to train and compare Machine Learning (ML) models for an ML-based Intrusion Detection System (IDS).
In short: To strengthen security in SDNs by creating a dataset and comparing ML models for an ML-based IDS present within the controller.
- Ryu SDN Framework: for the control plane of our simulated network
- Mininet: creation of virtual network within a VM
- Oracle VM VirtualBox: creation and management of VMs
- Ubuntu 18.04.6 LTS: hosting the Ryu controller, with all traffic flow passing through it
- Kali: intruder machine
- Metasploitable 2: victim machine
- Wireshark: to verify functionality of topology
- Scapy: packet manipulation python library for simulating attacks
- Nmap: network discovery tool, to simulate a probe attack
- Selenium: browser automation for simulating web attacks
- w3m: text-based web browser for use in terminals, to simulate normal traffic flow
- iPerf: for network performance measurement, to simulate normal traffic flow
- Distributed Internet Traffic Generator (D-ITG): to generate Telnet and Domain Name Server (DNS) traffic
The project can be divided into 4 major sections: proposing a network topology, data gathering, data processing and training / analysis of ML / DL models. The details and files for implementing each section can be found in their respective subdirectories:
- Network Topology
- Creating the Topology in VirtualBox
- Configuring the Ubuntu VM and Mininet
- Connecting a Mininet host to the internet
- Connecting the Kali and Metasploitable VMs - Linux Routing through the Ubuntu VM
- Configuring the Kali and Metasploitable VMs
- Dataset Collection
- Ryu Controller
- Attack traffic generation (11 unique attacks)
- Distributed Denial of Service (DDoS)
- Probe Attack
- Web Attacks
- Remote-to-Local (R2L)
- User-to-Root (U2R)
- Normal traffic generation (5 unique types)
- iPerf
- Internet
- Ping
- Telnet
- DNS
- Collected Dataset
- Data Processing
- Model Training & Analysis
- ML models comparison
- DL model performance evaluation
A total of roughly 291k attack flows (spread over 11 classes) and 122k normal flows (spread over 5 classes) were collected as CSV files through the ofctl_rest
API. This was done by pinging the API endpoints every second while the respective type of flow was being generated.
Attack Flow Counts | Normal Flow Counts |
---|---|
The data is comprised of 27 features collected from 3 of ofctl_rest
's endpoints.
The 27 collected features
No. | Feature name |
---|---|
1 | src |
2 | dst |
3 | table_id |
4 | ip_bytes |
5 | ip_packet |
6 | ip_duration |
7 | in_port |
8 | dl_dst |
9 | port_bytes |
10 | port_packet |
11 | port_flow_count |
12 | table_active_count |
13 | table_lookup_count |
14 | table_matched_count |
15 | port_rx_packets |
16 | port_tx_packets |
17 | port_rx_bytes |
18 | port_tx_bytes |
19 | port_rx_dropped |
20 | port_tx_dropped |
21 | port_rx_errors |
22 | port_tx_errors |
23 | port_rx_frame_err |
24 | port_rx_over_err |
25 | port_rx_crc_err |
26 | port_collisions |
27 | port_duration_sec |
Identifying and zero variance features were removed.
The removed features
- Identifying features
- src
- dst
- table_id
- in_port
- dl_dst
- Zero variance features
- port_rx_dropped
- port_tx_dropped
- port_rx_errors
- port_tx_errors
- port_rx_frame_err
- port_rx_over_err
- port_rx_crc_err
- port_collisions
Laplacian correction (adding 1 to all values) was applied before division transformation to handle division by zero.
To address the limitations of individual feature selection methods, three methods were employed and the intersection of their results was used to identify key variables. The three methods applied were:
- FDR[3]: controls for the expected proportion of false rejection of features in multiple significance testing
- Stepwise Selection[8]: an iterative process of adding important features to a null set of features and removing the worst performing features
- Boruta[6]: iteratively removes features that are relatively less statistically significant compared to random probability distribution
The 13 transformed features, as well as the final 5 selected by the above feature selection methods' intersection (bolded) is given below:
No | Feature Name | Equation |
---|---|---|
1 | ip_bytes_sec | |
2 | ip_packets_sec | |
3 | ip_bytes_packet | |
4 | port_bytes_sec | |
5 | port_packet_sec | |
6 | port_byte_packet | |
7 | port_flow_count_sec | |
8 | table_matched_lookup | |
9 | table_active_lookup | |
10 | port_rx_packets_sec | |
11 | port_tx_packets_sec | |
12 | port_rx_bytes_sec | |
13 | port_tx_bytes_sec |
RobustScaler[9] (RS) was then applied to handle outliers in the dataset, ensuring that the model is not overly affected by extreme values.
We selected an ML model by performing a comprehensive comparison between 6 unique models using multiple evaluation metrics before performing dimensionality reduction on the dataset. The models compared were:
- K-Nearest Neighbors (KNN): a simple, instance-based learning algorithm;
- Support Vector Machine (SVM): a powerful classifier that works by finding the optimal hyperplane for separation;
- Logistic Regression (LR): a probabilistic model used for binary classification
- Decision Tree (DT): a model that splits data into homogenous subsets based on feature values;
- Naive Bayes (NB): a probabilistic classifier based on Bayes' theorem with strong independence assumptions;
- Random Forest (RF): an ensemble of decision trees that improves accuracy by averaging multiple models.
RF was chosen over DT as RF demonstrates improved robustness and generalization by combining multiple decision trees, reducing the risk of overfitting. RF achieves 100% accuracy across all attack classes, making it highly reliable for our IDS.
The PCA, LDA and ICA dimensionality reduction techniques were then compared using the RF model.
The performance with LDA, the best of all three, was slightly lower than using the dataset without dimensionality reduction. Given that this performance difference is negligible, we opted to use LDA for dimensionality reduction, as it allows us to reduce the feature space from 5 to 4 features.
The final RF model pipeline, with manual division transformation, feature selection through intersection, scaling and dimensionality reduction, gave a final performance of 99.99% across all metrics, with marginal variance.
To achieve similar results without an extensive pre-processing pipeline, the Feed Forward Neural Network DL model was evaluated. Running the model directly on the dataset's transformed features achieved an accuracy of ~98.7%, with other metrics in a similar range. Performance dropped significantly when the dataset was reduced through feature selection and dimensionality reduction, suggesting that this approach is more suitable for larger datasets.
- Implementation of a functional IDS and Intrusion Prevention System (IPS)
Note
A legacy (partially working) IDS is provided in the Ryu Controller subdirectory, It utilizes an RF model trained with an RS PCA preprocessing pipeline. Predictions are a hit or miss.
- Test out more DL models
- Connecting Mininet hosts to VM's network interface for access to the internet : https://gist.github.com/shreyakupadhyay/84dc75607ec1078aca3129c8958f3683
Note
If getting 'unknown host' errors when trying to access websites, the DNS nameserver will have to be configured. This can be done through the following command within the xterm window of the host under consideration: echo 'nameserver 8.8.8.8' | tee /etc/resolv.conf
- Ryu's ofctl_rest API documentation : https://ryu.readthedocs.io/en/latest/app/ofctl_rest.html#ryu-app-ofctl-rest
- Classical attacks of Scapy : https://scapy.readthedocs.io/en/latest/usage.html#classical-attacks
[1] Alhilo, A. M. J., & Koyuncu, H. (2024). Enhancing SDN Anomaly Detection: A Hybrid Deep Learning Model with SCA-TSO Optimization. International Journal of Advanced Computer Science and Applications (IJACSA), 15(5).
[2] Alzahrani, A. O., & Alenazi, M. J. F. (2023). ML-IDSDN: Machine learning based intrusion detection system for software-defined network. Concurrency and Computation: Practice and Experience, 35(1), 1–12.
[3] Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57, 289–300.
[4] Erfan, A. (2022). DDoS attack detection scheme using hybrid ensemble learning and GA algorithm for Internet of Things. PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(18), 521–546.
[5] Halman, L., & Alenazi, M. (2023). MCAD: A Machine Learning Based Cyberattacks Detector in Software-Defined Networking (SDN) for Healthcare Systems. IEEE Access, 1–1.
[6] Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36, 1–13.
[7] Maddu, M., & Rao, Y. N. (2023). Network intrusion detection and mitigation in SDN using deep learning models. International Journal of Information Security, 1–14.
[8] Naser, M. (2021). Mapping functions: A physics-guided, data-driven and algorithm-agnostic machine learning approach to discover causal and descriptive expressions of engineering phenomena. Measurement, 185, 110098.
[9] Reddy, K. V. A., Ambati, S. R., Reddy, Y. S. R., & Reddy, A. N. (2021). AdaBoost for Parkinson’s disease detection using robust scaler and SFS from acoustic features. In Proceedings of the Smart Technologies, Communication and Robotics (STCR), 1–6.