Skip to content

Latest commit

 

History

History
29 lines (15 loc) · 1.69 KB

File metadata and controls

29 lines (15 loc) · 1.69 KB

Project Summary

This repository aims to compare the performances of multiple machine learning (ML) algorisms when the data distribution is highly imbalanced with one overwhelming response category. The dataset was randomly divided into two parts: training and test sets. Then, I will develop a statistical model out of the training set and apply it to the test set, recording down the misclassification errors.

Furthermore, I will use ROC and AUC to compare the performances and conclude KNN, as a non-parametrical method, outperforms the others when the distribution is highly imbalanced.

For the entire dataset, please refer to my Medium post: A Pain in the Neck: Predict A Rare Event using 5 Machine Learning Methods, https://towardsdatascience.com/classifying-rare-events-using-five-machine-learning-techniques-fab464573233.

Installing

This project is conducted in the R environment, and you have to pre-install the following libraries: readr, knitr, dplyr, plyr, class, reshape2, tree, randomForest, car, and e1071.

What is the data?

This dataset is collected by a Portuguese banking institution to assess the effect of direct marketing campaigns (phone calls) in predicting if the client will subscribe to a term deposit. The data source can be accessed here at https://archive.ics.uci.edu/ml/datasets/bank+marketing.

About the Author

Leihua Ye is a Ph.D. Researcher at the UC, Santa Barbara. He has received extensive training in Causal Inference, Research Design, Machine Learning, Big Data, and Machine Learning.

He receives his B.A. and M.A. from the Uni. of Nottingham.

Contact

Email: yeleihua@gmail.com

LinkedIn: www.linkedin.com/in/leihuaye

Tech Blog: https://leihua-ye.medium.com