This project was a collaboration between Hays Kronke, Emily Neaville, Bennett Northcutt, and Stephen Mims
Can we predict customer churn for the bank's credit card customers in order to reduce the rate of attrition?
Our group will use the dataset to build machine learning models that can accurately predict bank customers who are at risk of attrition. This dataset contains customer information ranging from demographic (age, gender, education) to financial data (income bracket, card history, credit limit). Using these features, our aim is to build a model that will be able to be used by bankers, sales managers, branch managers, or other decision makers to help the bank reduce customer churn.
- Data Loading and EDA
- Data preprocessing
- Fitting models and making predictions
- KNN, Logistic Regression, and Random Forest results
- Adjusted weights and oversampled Logistic Regression models
- Feature selection
- Optimized models using feature importance
Libraries used: pandas, sqlite3, seaborn, matplotlib, numpy, scipy, scikit-learn
After reading in and cleaning the data, we were able to conduct some exploratory analysis. The main takeaway from this in regard to building our machine learning models was taking note of the imbalanced classes. There were many more instances of existing customers than there were of attrited customers, as shown in this visualization.
- Utilized scipy to remove outliers using z-scores
- Encoded both the target variables (Attrition_Flag) and categorial features
- Split testing and training data
- Scaled the data
We built a KNN model, a random forest model, and a logisitic regression model. The confusion matrices for all three models can be seen below.
Creating the random forest model allows us to identify the most important features of the model After visualizing the important features, we create a new dataframe dropping some features and retrained our models for performance improved.
After retraining theKNearestNeighbors, the logistic regression, and the random forest models with the selected features, the random forest model still maintained the best performance. Using feature selection, we were able to slightly increase the accuracy by 1% and the recall of the model, which went from 84% to 87%. There was a slight decrease in the precision of the model, but in the end that was a hit we were willing to take for the improved performance. The confusion matrix of the optimized model of choice can be seen below.
Feature Selection Techniques in Machine Learning with Python
Baseline Models: Your Guide For Model Building
scipy zscore docs
Improve Model Performance using Feature Importance
Random Oversampling and Undersampling for Imbalanced Classification
How to improve logistic regression in imbalanced data with class weights