GitHub - Stuart-D-King/housing_predictive: Data science practice project from "Hands-On Machine Learning with SciKit-Learn and TensorFlow"

Modeling Housing Prices in California

This repository is a collection of Python scripts I developed while following along with the instruction from Chapter 2 of Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurélien Géron. This excellent book steps the reader through a simulated end-to-end machine learning project using California census data to build a prediction model of housing prices in California. The provided dataset consists of 20,640 observations and 10 attributes.

Data collection and EDA

After downloading and saving the data to file, initial exploratory data analysis was performed to get a quick feel for the data by plotting histograms of all numerical features. Next, the data was split into training and test sets using Scikit-Learn's StratifiedShuffleSplit class. Stratified sampling was performed using a newly created median income category variable. Additional visualizations were then created using the training data, including scatter plots of each district's latitude and longitude coordinates, with marker size determined by population and marker color by median home value. Furthermore, scatter and correlation matrices of numerical values illuminated some of the more promising attributes to predict the median house value, the strongest correlation being observed between median house value and median income.

Data preparation and transformation

To prepare the data for machine learning, a series of functions were created to impute missing values, encode categorical variables, and binarize (one-hot) numerical encodings. In addition, custom classes was created to add new attribute combinations to the dataset, as well as select user-defined features from the dataset. To streamline the series of transformations, a Scikit-Learn pipeline object was created to easily impute, encode, and standardize the training and test datasets.

Model selection and training

Three models were fit using the preprocessed training data: Linear Regression model, Decision Tree Regressor, and a Random Forest Regressor. Upon fitting each model using the training data, cross validation was performed to score and evaluate each model's performance. Each model's average root mean squared error (RMSE) was calculated using 10-fold cross validation. The RandomForestRegressor output the lowest RMSE score, and was thus selected as the best performing model.

Model tuning

To select the best model hyperparameters, I used Scikit-Learn's GridSearchCV to explore the various combinations of hyperparameters that would improve model performance.

Feature importance

Once the best estimator (model with set of hyperparameters that provided the lowest RMSE score) was determined, I evaluated the feature importance scores for each attribute of the data. Based on the results of this analysis, median income, inland (ocean proximity category), and population per household were the top three most useful predictors of housing prices.

Evaluation on test set

Finally, the Random Forest Regressor model was evaluated on the test set.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets/housing		datasets/housing
models		models
plots		plots
README.md		README.md
eda.py		eda.py
get_split_data.py		get_split_data.py
helper_classes.py		helper_classes.py
prep_run_ml.py		prep_run_ml.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modeling Housing Prices in California

Data collection and EDA

Data preparation and transformation

Model selection and training

Model tuning

Feature importance

Evaluation on test set

About

Releases

Packages

Languages

Stuart-D-King/housing_predictive

Folders and files

Latest commit

History

Repository files navigation

Modeling Housing Prices in California

Data collection and EDA

Data preparation and transformation

Model selection and training

Model tuning

Feature importance

Evaluation on test set

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages