Skip to content

FredaXin/capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Using Regression Models to Predict Home Affordability Ratios

Author: Freda Xin


Table of Contents

Please note that the notebooks in the Data Collection and Initial Data Cleaning section are just demonstrations for early stage development; for each notebook there is a corresponding python script to handle repeated tasks.


Problem Statement:

In regions where rapid new urban development emerges, home prices tend to become less affordable in the initial phase of the development. This phenomenon occurred in many U.S. regions such as San Francisco, Denver, and NYC, where the residents' income level could not keep up with the fast increase in home prices.

Home affordability ratio is defined as the median home price and median annual income ratio. "Historically a house in the US cost around 3 to 4 times the median annual income. During the housing bubble of 2007 the ratio surpassed 5 - in other words, the median price for a single family home in the United States cost more than 5 times the US median annual household income"(reference). In recent years, many U.S. neighborhoods have become unaffordable for the residents.

In my project, I will explore whether commercial activities in a given neighborhood can be predictive for home affordability ratios. I will use the Google Places API to gather data about neighborhood businesses, such as types, opening hours, price level etc. To calculate Home affordability ratio, I will use the Census data from incomebyzipcode.com. The aim of the project is to develop a regression model that can make quick predictions given the latest commercial activities in a neighborhood. This approach has the advantage of being up-to-date, comparing to the traditional method based on census information. It can also serve as an early indicator and to be used by any municipality: if certain patterns of business activities emerge, the problem of a neighborhood becomes unaffordable might fellow.


Executive Summary

I developed supervised regression models to predict affordability and used R2 score as the metric to measure the performance of the models.

During the data collection process, I encountered many limitations caused by the Google Places API, such as the limitation on the number of businesses returned per location and inaccuracy of the string search results. It is also worth noting that all Census data reply on ZIP Code Tabulation Areas (ZCTAs), which are different than zipcodes. When making Google API calls, ZCTAs were used instead of zipcodes.

During the EDA process, I found that the distribution of the target home affordability ratios is right-skewed. This confirmed the common knowledge: some areas in New York City (and New York State) are extremely unaffordable. I also discovered that most features do not have a linear relationship with the target; this might indicate that models that rely on linear regression may not perform well. When analyzing the correlation between the features and the target, I found that the feature 'open_now' was the top 1 feature that is positively correlated with the target. I further analyzed the pros and cons of using this feature in training the models.

The modeling process is divided into 3 phases:

Phase 1: the Naive Approach
The so-called “Naive Approach” is based on a simple idea: we want to be able to use ALL the original observations to train the models.

Motivated by this idea, I engineered aggregated features based on each zipcode, and concatenated them back to the original data-frame.

In theory, the benefit of this approach is twofold:

  • We would be able to capture the general information of each zipcode.
  • We would be able to retain the same amount of data as the original dataset. This will ensure that we have sufficient amount data to train the models.

Phase 2: Aggregation
I used the aggregated observations by zipcode, combined with census data from the Income dataset to train the models. I used the pattern sub-model technique to handle missing data. This enabled us to handle the missing data without imputation or dropping observations. 6 types of models were trained: Linear Regression (combined with various regularization techniques), Polynomial Regression, KNN, Tree based models, SVR, and Stochastic Gradient Descent. Each model was fitted with two datasets based on the pattern defined by the sub-model method.

Phase 3: Generalization
In this phase, none of the features from the Census data were used to train the models. The model was trained with New York State data. To test the transferability of the model, the LA dataset was used on the trained model to make predictions.

The general workflow of this project is shown in the following flowchart.


Workflow

workflow

Conclusion

Phase 1: the Naive Approach

I discovered that the naive approach led to data leakage issue, and therefore was invalid. However, this does NOT mean that in general we can't use aggregated features along with the original dataset; we just can't aggregate the observations the SAME way that we aggregated the target.

Phase 2: Aggregation

In Phase 2, I used the aggregated observations by zipcode, combined with census data from the Income dataset to train the model.

The pattern sub-model enabled me to handle missing data without imputation or dropping observations. Among all the regression models, the BaggingRegressor yields the best result based on the test R2 score. However, even the best model still shows signs of high variance, and the result is not optional.

Based on the model evaluation, features from the census data play an importance role in the model's performance. This finding posed challenges to my goal of developing a generalization of the model: i.e. using the model to predict home affordability ratio in other U.S. regions without retraining the model.

Phase 3: Generalization

Without the census data, the performance of the model decreased. Training only with the Google Places API data, the LinearRegression Model combined with L1 Regularization outperformed the others.

Using the trained LinearRegression Model to make predictions for the LA dataset, the model performed badly. Based on the test scores, as well as the residuals plots, we can conclude that the model trained on the New York State dataset is inadequate for the predication of home affordability ratios in LA, and that the trained model is not transferable.

This outcome makes intuitive sense: New York State and LA have very different commercial landscapes and demographics (such as median home prices and median annual income). The model that has been fitted and trained on the former is not able to capture the patterns of the latter.


Limitation and Next Steps

Next Steps

Step 1: Improve data quality:

  • Collect more data: during the early modeling process, I observed that the models' performance drastically improved after being trained with more data points (from only using NYC to NYS). Since the amount of observations (n = 1521), I will try to collect more data to improve the model's performance.
  • Sampling data from different regions in the U.S, and stratify the samples: In order to help the model learn a wider range of dataset, I will try to randomly sample zipcodes across the States.
  • If step 1 has been accomplished, and there is no significant improvement in the model's ability of making predictions, I will do the following:

Step 2: Reevaluate the assumptions:

  • Research on what factors have been proven to be related to home affordability ratio: the first and foremost assumption of this project is that commercial activities can be predictive for home affordability ratio. This assumption is very likely to be unsound. During this project, I have confirmed that including census data will increase the model performance; in other words, the commercial activity information collected from the Google Places API alone are not as predictive as combing with census data. There might be many factors that link to home affordability ratio, and further research need to be conducted.

Limitations

Using Google Place API has many limitations in the data collection process:

  • When using string search, the results returned are unpredictable: e.g. when making an API call using param "stores near zipcode 10010", some unexpected result was returned, such as a school or a government office.
  • The API only returns up to 3 calls per location (each location only returns up to 20 results). This limits the number of samples can be collect per location.
  • The API returns 20 results per call, and it is unclear what algorithms was behind these results: such as proximities to the location centroid or popularity of the business as search result. So the businesses returned by the calls might be biased based on Google's algorithms.
  • As mentioned in notebook 04, the 'open_now' feature is dependent on the time when the API calls are made.
  • The Google Place API is not free: making many API calls will make the project very expensive. Therefore, the amount of data that can be collected can be very limited based on the budget of the project.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published