You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AutoML is a relatively new and upcoming subset of machine learning. The main approach in AutoML is to limit the involvement of data scientists and let the tool handle all time-consuming processes in machine learning like data preprocessing, best algorithm selection, hyperparameter tuning, etc., thus saving time for setting up these ML models and speeding up their deployment. There are several AutoML tools available in the market these days.
In one of my previous blogathon articles, I had shared a comprehensive guide to AutoML with an easy AutoGluon example. This guide included a list of several AutoML tools currently available in the market. These AutoML tools can undoubtedly save a good amount of time, especially for a large and complex dataset. We will explore one such tool called ‘Auto-Sklearn’ in this article.
What is Auto-Sklearn?
Anyone familiar with machine learning knows about scikit-learn, the famous python package consisting of different classification and regression algorithms and is used for building machine learning models.
Auto-Sklearn is a Python-based open-source toolkit for doing AutoML. It employs the well-known Scikit-Learn machine learning package for data processing and machine learning algorithms. It also includes a Bayesian Optimization search technique to find the best model pipeline for the given dataset quickly. In this article, we’ll look at how to utilize Auto-Sklearn for classification and regression tasks.
Let us install the Auto-Sklearn package first.
pipinstallauto-sklearn
(If you are using google colab, ensure your SciPy version is the latest; else upgrade it using pip command and restart the runtime)
Become a Full Stack Data Scientist
Transform into an expert and significantly impact the world of data science.
Now that we have installed the AutoML tool, we will import the basic packages for preprocessing the dataset and visualization.
We will use the heart disease prediction dataset available on the UCI repository. For convenience, let us use the .csv version of this data from Kaggle. You can also use any classification dataset of your choice or import a toy dataset available from the sklearn library.
Dataset details: This dataset contains 303 samples and 14 attributes (the original dataset has 76 features while the .csv version has the 14 subsets of the original dataset).
There are only two classes (0= healthy, 1= heart disease), so this is a binary classification problem. Also, This indicates that this is an imbalanced dataset. Due to this, the accuracy score of this model will be less reliable. However, we will first test the imbalanced dataset by directly feeding it to the autosklearn classifier. Later we will adjust the number of samples for these two classes and test the accuracy to see how the classifier performs.
#creating X and y X=df.drop(['target'],axis=1)
y=df['target']
#split into train and test setsX_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape,y_train.shape, y_test.shape
Auto-Sklearn
Next, we will import the classification models from autosklearn using the following command.
importautosklearn.classification
Then we will create an instance of the AutoSklearnClassifier for the classification task.
Here, we are setting the max time for this task using the ‘time_left_for_this_task’ argument and assigning 5*60 sec or 5 mins to it. If nothing is specified for this argument, the process will run for an hour, i.e., 60mins. Then, we will also set the time allocated as 30sec to each model evaluation using the “per_run_time_limit” argument.
In this command, there are other arguments like n_jobs (number of parallel jobs), ensemble_size, initial_configurations_via_metalearning, which can be used to fine-tune the classifier. By default, the above search command creates an ensemble of top-performing models. In order to avoid overfitting, we can disable it by changing the setting “ensemble_size” = 1 and “initial_configurations_via_metalearning” = 0. We have excluded these while setting up the classifier to keep the tutorial simple.
We will also provide a temporary path for the log to be saved, and we can use it to print the run details later.
Now, we will fit the classifier.
automl.fit(X_train, y_train)
The sprint_statistics() function summarizes the above search and the performance of the selected best model.
pprint(automl.sprint_statistics())
Output
Alternatively, we can also print a leaderboard for all the models considered by the search, organized by their ranks using the following command.
print(automl.leaderboard())
Output | Auto-Sklearn
The top two models selected by the classifier were Random forest and Passive_aggressive respectively.
Additionally, we can print the information about the considered models using the following command:
pprint(automl.show_models())
Lastly, we can also print the final score of the ensemble and the confusion matrix using the following lines of code.
# Score of the final ensemblefromsklearn.metricsimportaccuracy_scorem1_acc_score=accuracy_score(y_test, y_pred)
m1_acc_scorefromsklearn.metricsimportconfusion_matrix, accuracy_scorey_pred=automl.predict(X_test)
conf_matrix=confusion_matrix(y_pred, y_test)
sns.heatmap(conf_matrix, annot=True)
Auto-Sklearn
We can use the following command to separate healthy and unhealthy samples in the dataset.
As the number of unhealthy samples is more, we will use the resampling technique (oversampling) and increase the samples of healthy individuals in the dataset. To adjust the skew, we can use the following commands –
up_sampled=resample(healthy, replace=True, n_samples=len(unhealthy), random_state=42)
up_sampled=pd.concat([unhealthy, up_sampled])
#check updated class countsup_sampled['target'].value_counts()
We can also use techniques like SMOTE, Ensemble learning (bagging, boosting), NearMiss Algorithm to address the imbalance in the dataset. Additionally, we can use metrics such as F1-score, precision, and recall to evaluate the model’s performance.
Now that we have adjusted the skew, we will create X and y sets for classification again. Let us name them X1 and y1 to avoid confusion.
We need to repeat all the steps from setting up the classifier to printing a confusion matrix for this new X1 and y1. Complete code for this task is available on my GitHub repository.
Finally, we can compare the two accuracies for skewed data and adjusted data using –
From the above chart, the model accuracy has slightly reduced after over-sampling, we can see that the model is now better optimized. Although we have used quite a few additional commands for preprocessing the data and evaluating the results, running an AutoSklearn classifier requires only one single line of code. Even with skewed data, the accuracy achieved by the model is really good.
Regression Task
Now we will use the Regression models from AutoSklearn in this section.
For this task, let us use the simple ‘flights’ dataset from the seaborn datasets library. We will load the dataset with the following command.
#loading the datasetdf=sns.load_dataset('flights')
df.head()
Regression Task
Dataset details: This dataset contains 144 rows and 3 columns, namely year, month, and the number of passengers.
The task here is to predict the number of passengers using the other two features.
From the above-printed summary, we understand that the regressor ran a total of 59 models, and the calculated performance of the final regression model was R2 of 0.985, which is quite good.
Since the regressor has optimized the R2 metric by default, let us print the mean absolute error to evaluate the performance of the model better.
# evaluate the best modely_pred=automl.predict(X_test)
mae=mean_absolute_error(y_test, y_pred)
print("MAE: %.3f"%mae)
The mean absolute error is acceptable looking at the R2 value achieved by the model and the size of the example dataset used for this task.
We can also plot the predicted values against the actual values using matplotlib as shown below.
Overall, we can say that the MAE value is small, and the model achieved a high validation score is 0.985, indicating that the model performance is good.
Saving the trained models.
The above-trained models for classification and regression can be saved using python packages Pickle and JobLib. These saved models can then be used to make predictions directly on new data. We can save the models as:
Using Pickle
importpickle# save the model filename='final_model.sav'pickle.dump(model, open(filename, 'wb'))
Here ‘wb’ argument means that we are writing the file to the disk in binary mode. Further, we can load this saved model as :
#load the modelloaded_model=pickle.load(open(filename, 'rb'))
result=loaded_model.score(X_test, Y_test)
print(result)
Here ‘rb’ command indicates we are reading the file in binary mode.
Using JobLib
Similarly, we can save the trained models in JobLib using the following command.
importjoblib# save the model filename='final_model.sav'joblib.dump(model, filename)
We can also reload these saved models later for predictions on new data.
# load the model from diskload_model=joblib.load(filename)
result=load_model.score(X_test, Y_test)
print(result)
Conclusion
In this article, we saw the application of the Auto-Sklearn for both classification and regression models. For both tasks, we did not require to specify a particular algorithm. Instead, the tool itself iterated through several inbuilt algorithms and achieved good results (higher accuracy in the classification model and lower mean absolute error in the regression model). Thus, AutoSklearn can be a valuable tool to build better machine learning models with a few lines of code. The complete tutorial for this article is available on my GitHub repository.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Introduction
AutoML is a relatively new and upcoming subset of machine learning. The main approach in AutoML is to limit the involvement of data scientists and let the tool handle all time-consuming processes in machine learning like data preprocessing, best algorithm selection, hyperparameter tuning, etc., thus saving time for setting up these ML models and speeding up their deployment. There are several AutoML tools available in the market these days.
In one of my previous blogathon articles, I had shared a comprehensive guide to AutoML with an easy AutoGluon example. This guide included a list of several AutoML tools currently available in the market. These AutoML tools can undoubtedly save a good amount of time, especially for a large and complex dataset. We will explore one such tool called ‘Auto-Sklearn’ in this article.
What is Auto-Sklearn?
Anyone familiar with machine learning knows about scikit-learn, the famous python package consisting of different classification and regression algorithms and is used for building machine learning models.
Auto-Sklearn is a Python-based open-source toolkit for doing AutoML. It employs the well-known Scikit-Learn machine learning package for data processing and machine learning algorithms. It also includes a Bayesian Optimization search technique to find the best model pipeline for the given dataset quickly. In this article, we’ll look at how to utilize Auto-Sklearn for classification and regression tasks.
Let us install the Auto-Sklearn package first.
(If you are using google colab, ensure your SciPy version is the latest; else upgrade it using pip command and restart the runtime)
Become a Full Stack Data Scientist
Transform into an expert and significantly impact the world of data science.
Now that we have installed the AutoML tool, we will import the basic packages for preprocessing the dataset and visualization.
Classification Task
We will use the heart disease prediction dataset available on the UCI repository. For convenience, let us use the .csv version of this data from Kaggle. You can also use any classification dataset of your choice or import a toy dataset available from the sklearn library.
Dataset details: This dataset contains 303 samples and 14 attributes (the original dataset has 76 features while the .csv version has the 14 subsets of the original dataset).
There are only two classes (0= healthy, 1= heart disease), so this is a binary classification problem. Also, This indicates that this is an imbalanced dataset. Due to this, the accuracy score of this model will be less reliable. However, we will first test the imbalanced dataset by directly feeding it to the autosklearn classifier. Later we will adjust the number of samples for these two classes and test the accuracy to see how the classifier performs.
Auto-Sklearn
Next, we will import the classification models from autosklearn using the following command.
Then we will create an instance of the AutoSklearnClassifier for the classification task.
Here, we are setting the max time for this task using the ‘time_left_for_this_task’ argument and assigning 5*60 sec or 5 mins to it. If nothing is specified for this argument, the process will run for an hour, i.e., 60mins. Then, we will also set the time allocated as 30sec to each model evaluation using the “per_run_time_limit” argument.
In this command, there are other arguments like n_jobs (number of parallel jobs), ensemble_size, initial_configurations_via_metalearning, which can be used to fine-tune the classifier. By default, the above search command creates an ensemble of top-performing models. In order to avoid overfitting, we can disable it by changing the setting “ensemble_size” = 1 and “initial_configurations_via_metalearning” = 0. We have excluded these while setting up the classifier to keep the tutorial simple.
We will also provide a temporary path for the log to be saved, and we can use it to print the run details later.
Now, we will fit the classifier.
The sprint_statistics() function summarizes the above search and the performance of the selected best model.
Output
Alternatively, we can also print a leaderboard for all the models considered by the search, organized by their ranks using the following command.
Output | Auto-Sklearn
The top two models selected by the classifier were Random forest and Passive_aggressive respectively.
Additionally, we can print the information about the considered models using the following command:
Lastly, we can also print the final score of the ensemble and the confusion matrix using the following lines of code.
Auto-Sklearn
We can use the following command to separate healthy and unhealthy samples in the dataset.
As the number of unhealthy samples is more, we will use the resampling technique (oversampling) and increase the samples of healthy individuals in the dataset. To adjust the skew, we can use the following commands –
We can also use techniques like SMOTE, Ensemble learning (bagging, boosting), NearMiss Algorithm to address the imbalance in the dataset. Additionally, we can use metrics such as F1-score, precision, and recall to evaluate the model’s performance.
Now that we have adjusted the skew, we will create X and y sets for classification again. Let us name them X1 and y1 to avoid confusion.
We need to repeat all the steps from setting up the classifier to printing a confusion matrix for this new X1 and y1. Complete code for this task is available on my GitHub repository.
Finally, we can compare the two accuracies for skewed data and adjusted data using –
From the above chart, the model accuracy has slightly reduced after over-sampling, we can see that the model is now better optimized. Although we have used quite a few additional commands for preprocessing the data and evaluating the results, running an AutoSklearn classifier requires only one single line of code. Even with skewed data, the accuracy achieved by the model is really good.
Regression Task
Now we will use the Regression models from AutoSklearn in this section.
For this task, let us use the simple ‘flights’ dataset from the seaborn datasets library. We will load the dataset with the following command.
Regression Task
Dataset details: This dataset contains 144 rows and 3 columns, namely year, month, and the number of passengers.
The task here is to predict the number of passengers using the other two features.
We now use autosklearnregressor for this regression task.
Now, let us print the statistics of the model.
From the above-printed summary, we understand that the regressor ran a total of 59 models, and the calculated performance of the final regression model was R2 of 0.985, which is quite good.
Since the regressor has optimized the R2 metric by default, let us print the mean absolute error to evaluate the performance of the model better.
The mean absolute error is acceptable looking at the R2 value achieved by the model and the size of the example dataset used for this task.
We can also plot the predicted values against the actual values using matplotlib as shown below.
Overall, we can say that the MAE value is small, and the model achieved a high validation score is 0.985, indicating that the model performance is good.
Saving the trained models.
The above-trained models for classification and regression can be saved using python packages Pickle and JobLib. These saved models can then be used to make predictions directly on new data. We can save the models as:
Here ‘wb’ argument means that we are writing the file to the disk in binary mode. Further, we can load this saved model as :
Here ‘rb’ command indicates we are reading the file in binary mode.
Similarly, we can save the trained models in JobLib using the following command.
We can also reload these saved models later for predictions on new data.
Conclusion
In this article, we saw the application of the Auto-Sklearn for both classification and regression models. For both tasks, we did not require to specify a particular algorithm. Instead, the tool itself iterated through several inbuilt algorithms and achieved good results (higher accuracy in the classification model and lower mean absolute error in the regression model). Thus, AutoSklearn can be a valuable tool to build better machine learning models with a few lines of code. The complete tutorial for this article is available on my GitHub repository.
Beta Was this translation helpful? Give feedback.
All reactions