phishing-website-detection-content-based

This is an End-to-End Machine Learning Project which focuses on phishing websites to classify phishing and legitimate ones. Particularly, We focused on content-based features like html tag based features. You can find feature extraction, data collection, preparation process here. Also, building ML models, evaluating them are available here.

inputs

csv files of phishing and legitimate URLs
- verified_online.csv --> phishing websites URLs from phishtank.org
- tranco_list.csv --> legitimate websites URLs from tranco-list.eu

general flow

Use csv file to get URLs
Send a request to each URL and receive a response by requests library of python
Use the content of response and parse it by BeautifulSoup module
Extract features and create a vector which contains numerical values for each feature
Repeat feature extraction process for all content\websites and create a structured dataframe
Add label at the end to the dataframes | 1 for phishing 0 for legitimate
Save the dataframe as csv and structured_data files are ready!
- Check "structured_data_legitimate.csv" and "structured_data_phishing.csv" files.
After obtaining structured data, you can use combine them and use them as train and test data
You can split data as train and test like in the machine_learning.py first part, or you can implement K-fold cross-validation like in the second part of the same file. We implemented K-fold as K=5.
Then We implemented five different ML models:
- Support Vector Machine
- Gaussian Naive Bayes
- Decision Tree
- Random Forest
- AdaBoost
You can obtain the confusion matrix, and performance measures: accuracy, precision, recall
Finally, We visualized the performance measures for all models.
- Naive Bayes is the best for my case.

important notes

features are content-based and need BeautifulSoup module's methods and fields etc So, you should install it.

dataset

with your URL list, you can create your own dataset by using data_collector python file.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
.idea		.idea
.streamlit		.streamlit
__pycache__		__pycache__
codeAssests		codeAssests
pages		pages
static		static
styles		styles
Blog on going.txt		Blog on going.txt
CodeMightGetUsed.txt		CodeMightGetUsed.txt
Home.py		Home.py
README.md		README.md
Textual_Content.txt		Textual_Content.txt
data.db		data.db
data_collector.py		data_collector.py
db_fxns.py		db_fxns.py
ensemble_model.pkl		ensemble_model.pkl
feature_extraction.py		feature_extraction.py
features.py		features.py
machine_learning.py		machine_learning.py
machine_learning_2.py		machine_learning_2.py
menu.py		menu.py
requirements.txt		requirements.txt
structured_data_legitimate.csv		structured_data_legitimate.csv
structured_data_phishing.csv		structured_data_phishing.csv
tranco_list.csv		tranco_list.csv
verified_online.csv		verified_online.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phishing-website-detection-content-based

inputs

general flow

important notes

dataset

About

Releases

Packages

Languages

AdarshVajpayee19/Phishing

Folders and files

Latest commit

History

Repository files navigation

phishing-website-detection-content-based

inputs

general flow

important notes

dataset

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages