This project is designed to predict whether an employee will leave the current job for a new company. For this, two models will be created using Azure ML:
- Using AutoML to get the best algorithm
- Using Logistic Regression and tuning the parameters using HyperDrive. The best model will then be deployed using Azure Container Instance which can later be consumed using REST Api.
The project requires access to AzureML Studio.
Steps to be followed: 1.Using the dataset provided in this repository, create a new dataset in the Azure ML studio in default Blob Storage. 2. Create a new compute target. 2.Import the notebooks attached in this repository in the Notebooks section in Azure ML studio. 3.Run the autoML and hyperdrive notebooks using the details given in the notebooks. 4.Run endpoint.py file to consume the endpoint created and get back the predicted results.
The dataset is taken from kaggle : "https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists".
The task is to predict whether the employee will leave the current job or not, based on the following factors-
- enrollee_id : Unique ID for candidate
- city: City code
- city_ development _index : Developement index of the city (scaled)
- gender: Gender of candidate
- relevent_experience: Relevant experience of candidate
- enrolled_university: Type of University course enrolled if any
- education_level: Education level of candidate
- major_discipline :Education major discipline of candidate
- experience: Candidate total experience in years
- company_size: No of employees in current employer's company
- company_type : Type of current employer
- lastnewjob: Difference in years between previous job and current job
- training_hours: training hours completed
- target: 0 – Not looking for job change, 1 – Looking for a job change
The data can be accessed by downloading the data into local server and then uploading it to Dataset subsection of Microsoft AzureML.
Automated Machine Learning is the process of automating the time-consuming, iterative tasks of ML model development. It allows to build the models with high scale efficiency & productivity all while sustaining the model quality. In case of classification problem, many models such as XGBoost, RandomForest, StackEnsemble, VotingEnsemble etc. are compared
AutoML Configuration used for this project:
- The task is Binary classification, hence we use 'accuracy' as primary metric.
- Cross validation of 6 folds is choosen, as it gave better accuracy than 3 or 4 fold.
- Iterations are processed concurrently so as to speed up our training time.
- Early stopping is enabled to prevent overfitting.
- Experiment timeout is set to be 30 minutes.
- Featurization parameter is set to be "auto" for auto feature scaling.
After comparing 37 algorithms, the Best model obtained is Voting Ensemble with an Accuracy of 80.17%
The screenshot of the details of various algorithms is shown below:
Best Run model-ID and accuracy, along with other parameters:
The model can be imporved by increasing the number of iterations or trying for various cross-validation folds. Deep learning/neural network based classification can also be used for better results.
Since the problem involved Binary classification, Logistic Regression has been used as it is simple to train and works well as compared to other complex algorithms. The two parameters: '--C' (Inverse of regularization strength. Smaller values cause stronger regularization) and '--max_iter' (Maximum number of iterations to converge) are selected for tuning using HyperDrive.
- The choice used for C are (0.001, 0.01, 0.1, 1, 10, 100, 200), while that for max iterations are (50,100,150,200,250,300).
Here's the screenshot of best results and optimized parameters obtained using HyperDrive: Rundetails Widget:
The model choosen with AutoML is choosen for deployment: Azure Container Instance is used for deployment of the model as webservice. The details for method of deployment can be found in automl.ipynb under Model Deployment section.
The number of cpu cores and memory for the web service has been set to 1 and 1GB respectively.
We can see the deployment state set as healthy below, stating that model has deployed successfully. This model can now be consumed using REST Api by sending HTTP requests to it.
Now we can consume the endpoint using scoring URL genereated after deployment. The sample input to the endpoint is as below. This can also be found in endpoint.py file of the repository.
Here's a link of Screencast demonstrating the consuming of the Deployed model: https://1drv.ms/u/s!Avt8pJRrCCqEhmNaZOPPxJfcpQlh?e=8Y7BYf