Skip to content

Latest commit

 

History

History
295 lines (214 loc) · 15.7 KB

README.md

File metadata and controls

295 lines (214 loc) · 15.7 KB

SDN Emulation and Development of Dataset for ML-Based Intrusion Detection

Project work done for the IP1302 course of our 5th Semester.

Reference Paper: MCAD: A Machine Learning Based Cyberattacks Detector in Software-Defined Networking (SDN) for Healthcare Systems [1]

Team Members

Table of Contents

Aim

Our overall goal was to delve into the creation of complex Software Defined Network (SDN) topologies - through the use of the Ryu SDN Framework, Mininet and VirtualBox Virtual Machines (VMs) - and the collection of flow statistics through Ryu's inbuilt Application Programming Interface (API) - ofctl_rest. Through this, we simulated various attack and normal traffic flows within our emulated network to make a dataset, which was used to train and compare Machine Learning (ML) models for an ML-based Intrusion Detection System (IDS).

In short: To strengthen security in SDNs by creating a dataset and comparing ML models for an ML-based IDS present within the controller.

image

Tools / Libraries Used

Topology Creation

Attack Traffic Generation

  • Scapy: packet manipulation python library for simulating attacks
  • Nmap: network discovery tool, to simulate a probe attack
  • Selenium: browser automation for simulating web attacks

Normal Traffic Generation

Proposed Model

image

Implementation

The project can be divided into 4 major sections: proposing a network topology, data gathering, data processing and training / analysis of ML / DL models. The details and files for implementing each section can be found in their respective subdirectories:

  1. Network Topology
    • Creating the Topology in VirtualBox
    • Configuring the Ubuntu VM and Mininet
    • Connecting a Mininet host to the internet
    • Connecting the Kali and Metasploitable VMs - Linux Routing through the Ubuntu VM
    • Configuring the Kali and Metasploitable VMs
  2. Dataset Collection
    • Ryu Controller
    • Attack traffic generation (11 unique attacks)
      • Distributed Denial of Service (DDoS)
      • Probe Attack
      • Web Attacks
      • Remote-to-Local (R2L)
      • User-to-Root (U2R)
    • Normal traffic generation (5 unique types)
      • iPerf
      • Internet
      • Ping
      • Telnet
      • DNS
    • Collected Dataset
  3. Data Processing
    • Preprocessing
      • Cleansing / Shuffling
      • Division Transformation
    • Feature Selection
      • Benjamini–Hochberg False Discovery Rate (FDR)[3]
      • Stepwise Selection[8]
      • Boruta[6]
    • Scaling
    • Dimensionality Reduction
      • Principal Component Analysis (PCA)
      • Linear Discriminant Analysis (LDA)
      • Independent Component Analysis (ICA)
  4. Model Training & Analysis
    • ML models comparison
    • DL model performance evaluation

Results

Collected Dataset

A total of roughly 291k attack flows (spread over 11 classes) and 122k normal flows (spread over 5 classes) were collected as CSV files through the ofctl_rest API. This was done by pinging the API endpoints every second while the respective type of flow was being generated.

Attack Flow Counts Normal Flow Counts
Attack counts Normal counts

The data is comprised of 27 features collected from 3 of ofctl_rest's endpoints.

The 27 collected features
No. Feature name
1 src
2 dst
3 table_id
4 ip_bytes
5 ip_packet
6 ip_duration
7 in_port
8 dl_dst
9 port_bytes
10 port_packet
11 port_flow_count
12 table_active_count
13 table_lookup_count
14 table_matched_count
15 port_rx_packets
16 port_tx_packets
17 port_rx_bytes
18 port_tx_bytes
19 port_rx_dropped
20 port_tx_dropped
21 port_rx_errors
22 port_tx_errors
23 port_rx_frame_err
24 port_rx_over_err
25 port_rx_crc_err
26 port_collisions
27 port_duration_sec

Data Processing

Identifying and zero variance features were removed.

The removed features
  • Identifying features
    • src
    • dst
    • table_id
    • in_port
    • dl_dst
  • Zero variance features
    • port_rx_dropped
    • port_tx_dropped
    • port_rx_errors
    • port_tx_errors
    • port_rx_frame_err
    • port_rx_over_err
    • port_rx_crc_err
    • port_collisions

Laplacian correction (adding 1 to all values) was applied before division transformation to handle division by zero.

To address the limitations of individual feature selection methods, three methods were employed and the intersection of their results was used to identify key variables. The three methods applied were:

  • FDR[3]: controls for the expected proportion of false rejection of features in multiple significance testing
  • Stepwise Selection[8]: an iterative process of adding important features to a null set of features and removing the worst performing features
  • Boruta[6]: iteratively removes features that are relatively less statistically significant compared to random probability distribution

The 13 transformed features, as well as the final 5 selected by the above feature selection methods' intersection (bolded) is given below:

No Feature Name Equation
1 ip_bytes_sec $\frac{\text{ip\_bytes}}{\text{ip\_duration}}$
2 ip_packets_sec $\frac{\text{ip\_packet}}{\text{ip\_duration}}$
3 ip_bytes_packet $\frac{\text{ip\_bytes}}{\text{ip\_packet}}$
4 port_bytes_sec $\frac{\text{port\_bytes}}{\text{ip\_duration}}$
5 port_packet_sec $\frac{\text{port\_packet}}{\text{ip\_duration}}$
6 port_byte_packet $\frac{\text{port\_bytes}}{\text{port\_packet}}$
7 port_flow_count_sec $\frac{\text{port\_flow\_count}}{\text{ip\_duration}}$
8 table_matched_lookup $\frac{\text{table\_matched\_count}}{\text{table\_lookup\_count}}$
9 table_active_lookup $\frac{\text{table\_active\_count}}{\text{table\_lookup\_count}}$
10 port_rx_packets_sec $\frac{\text{port\_rx\_packets}}{\text{port\_duration\_sec}}$
11 port_tx_packets_sec $\frac{\text{port\_tx\_packets}}{\text{port\_duration\_sec}}$
12 port_rx_bytes_sec $\frac{\text{port\_rx\_bytes}}{\text{port\_duration\_sec}}$
13 port_tx_bytes_sec $\frac{\text{port\_tx\_bytes}}{\text{port\_duration\_sec}}$

RobustScaler[9] (RS) was then applied to handle outliers in the dataset, ensuring that the model is not overly affected by extreme values.

Model Training & Analysis

We selected an ML model by performing a comprehensive comparison between 6 unique models using multiple evaluation metrics before performing dimensionality reduction on the dataset. The models compared were:

  • K-Nearest Neighbors (KNN): a simple, instance-based learning algorithm;
  • Support Vector Machine (SVM): a powerful classifier that works by finding the optimal hyperplane for separation;
  • Logistic Regression (LR): a probabilistic model used for binary classification
  • Decision Tree (DT): a model that splits data into homogenous subsets based on feature values;
  • Naive Bayes (NB): a probabilistic classifier based on Bayes' theorem with strong independence assumptions;
  • Random Forest (RF): an ensemble of decision trees that improves accuracy by averaging multiple models.
Comparison of ML models

ml_models

RF was chosen over DT as RF demonstrates improved robustness and generalization by combining multiple decision trees, reducing the risk of overfitting. RF achieves 100% accuracy across all attack classes, making it highly reliable for our IDS.

The PCA, LDA and ICA dimensionality reduction techniques were then compared using the RF model.

Comparison of Dimensionality Reduction techniques

rf_dim_red

The performance with LDA, the best of all three, was slightly lower than using the dataset without dimensionality reduction. Given that this performance difference is negligible, we opted to use LDA for dimensionality reduction, as it allows us to reduce the feature space from 5 to 4 features.

The final RF model pipeline, with manual division transformation, feature selection through intersection, scaling and dimensionality reduction, gave a final performance of 99.99% across all metrics, with marginal variance.

To achieve similar results without an extensive pre-processing pipeline, the Feed Forward Neural Network DL model was evaluated. Running the model directly on the dataset's transformed features achieved an accuracy of ~98.7%, with other metrics in a similar range. Performance dropped significantly when the dataset was reduced through feature selection and dimensionality reduction, suggesting that this approach is more suitable for larger datasets.

Effect of Dimensionality Reduction techniques on FFNN

ffnn_dim_red

Future Work

  • Implementation of a functional IDS and Intrusion Prevention System (IPS)

Note

A legacy (partially working) IDS is provided in the Ryu Controller subdirectory, It utilizes an RF model trained with an RS PCA preprocessing pipeline. Predictions are a hit or miss.

  • Test out more DL models

Helpful Links

Note

If getting 'unknown host' errors when trying to access websites, the DNS nameserver will have to be configured. This can be done through the following command within the xterm window of the host under consideration: echo 'nameserver 8.8.8.8' | tee /etc/resolv.conf

References

[1] Alhilo, A. M. J., & Koyuncu, H. (2024). Enhancing SDN Anomaly Detection: A Hybrid Deep Learning Model with SCA-TSO Optimization. International Journal of Advanced Computer Science and Applications (IJACSA), 15(5).

[2] Alzahrani, A. O., & Alenazi, M. J. F. (2023). ML-IDSDN: Machine learning based intrusion detection system for software-defined network. Concurrency and Computation: Practice and Experience, 35(1), 1–12.

[3] Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57, 289–300.

[4] Erfan, A. (2022). DDoS attack detection scheme using hybrid ensemble learning and GA algorithm for Internet of Things. PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(18), 521–546.

[5] Halman, L., & Alenazi, M. (2023). MCAD: A Machine Learning Based Cyberattacks Detector in Software-Defined Networking (SDN) for Healthcare Systems. IEEE Access, 1–1.

[6] Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36, 1–13.

[7] Maddu, M., & Rao, Y. N. (2023). Network intrusion detection and mitigation in SDN using deep learning models. International Journal of Information Security, 1–14.

[8] Naser, M. (2021). Mapping functions: A physics-guided, data-driven and algorithm-agnostic machine learning approach to discover causal and descriptive expressions of engineering phenomena. Measurement, 185, 110098.

[9] Reddy, K. V. A., Ambati, S. R., Reddy, Y. S. R., & Reddy, A. N. (2021). AdaBoost for Parkinson’s disease detection using robust scaler and SFS from acoustic features. In Proceedings of the Smart Technologies, Communication and Robotics (STCR), 1–6.