Vehicle Loan Default Prediction

This Capstone project was done as a part of my data science course.

Business Problem

Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. This warrants a study to estimate the determinants of vehicle loan default. The need for a better credit risk scoring model is also raised by these institutions.
Analyzing this will help the financial institutions to minimize the loan default rates, which helps us gain a customer base by providing the loans to correct persons.

Project Overview

The objective of the problem is to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date.
Performed EDA to understand the relation of target variable with the other features.
Statistical Analysis techniques like ANOVA for numerical and Chi-square for the categorical variables were performed to find the significance of the features with respect to the target.
Base Models were built in Logistic Regression, Random Forest, KNN and LightGBM with Kfold cross-validation.
Created new features like age at the time of disbursement, disbursement month, etc. The dataset represents Vehicle Loan Default of L&T financial institution.

Dataset Description

The dataset represents Vehicle Loan Default of L&T financial institution. There are 2,33,134 records and 42 columns Following Information regarding the loan and loanee are provided in the datasets:

Loanee Information (Demographic data like age, Identity proof etc.)
Loan Information (Disbursal details, loan to value ratio etc.)
Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)

EDA

Univariate, Bivariate and Multivariate Analysis were performed to bring out important aspects of data into focus for further analysis. Some of the highlights from EDA are listed below.

Inference

The count of non defaulters is more than defaulters where 0 (78.29%) is a non defaulter and 1 (21.71%) is a defaulter in the Loan Default countplot.
From the distribution plot, it is clear that as the LTV increases the persons probability of defaulting the loan also increases. Therefore higher the LTV higher are the chances of a person defaulting the loan.

Inference

As the disbursed amount increases the asset cost too increases and there are no defaulters as the disbursed amount increases.

Inference

As the risk increases the score decreases. The person who has a higher score might tend to be a non defaulter while the person having a low score might be a defaulter.

Statistical Data Analysis

ANOVA for Numerical Columns

Since our target variable is dichotomous and to find whether the feature is significant or not, we did a oneway-anova test to find out the significance of the feature.

All our variables have rejected the null hypothesis and the hypothesis is in favor of alternate hypothesis which states that the mean of two groups are not equal which will help in splitting our target variable.

ChiSquare for Categorical Columns

The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables.
From the test, we can see Pan flag, Unique_id, mobileflag and Sec.Overdue.Accts have failed the test and so they are insignificant variables to the target variable.

Evaluation Metric

ROC AUC-Score was chosen as the metric for the models.

Models

Here we are trying Linear, distance and tree-based models in the conviction which splits the target variables at its best. Since the metric of interest for the problem statement is AUC, from the below output we can conclude that tree based generally outperforms linear based models hence we would be using tree-based model for our further analysis.

Logistic Regression ROC AUC SCORE: 0.55
KNN ROC AUC SCORE: 0.52
Random Forest ROC AUC SCORE: 0.60
KNN Classifier ROC AUC SCORE: 0.55

Conclusion

In this project Vehicle loan defaulters in the first EMI for L&T have been determined. The best performing models were ensemble-based models.
The data seems to exactly mimic the real-life scenario which is very evident since there many zero values present which corresponds to first-time customers.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Dataset		Dataset
Images		Images
BASE_MODEL_BUILDING.ipynb		BASE_MODEL_BUILDING.ipynb
Base_Models_Using_CV.ipynb		Base_Models_Using_CV.ipynb
Final_Model.ipynb		Final_Model.ipynb
L&T Basic Data Preprocessing EDA and Data Visualizations.ipynb		L&T Basic Data Preprocessing EDA and Data Visualizations.ipynb
LICENSE		LICENSE
README.md		README.md
Statistical_Modeling.ipynb		Statistical_Modeling.ipynb
project-basic-models-comparison.ipynb		project-basic-models-comparison.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vehicle Loan Default Prediction

Business Problem

Project Overview

Dataset Description

EDA

Inference

Inference

Inference

Statistical Data Analysis

ANOVA for Numerical Columns

ChiSquare for Categorical Columns

Evaluation Metric

Models

Conclusion

References

About

Releases

Packages

Languages

License

nanthasnk/Capstone-LT-Vehicle-Loan-Default-Prediction

Folders and files

Latest commit

History

Repository files navigation

Vehicle Loan Default Prediction

Business Problem

Project Overview

Dataset Description

EDA

Inference

Inference

Inference

Statistical Data Analysis

ANOVA for Numerical Columns

ChiSquare for Categorical Columns

Evaluation Metric

Models

Conclusion

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages