Author: Freda Xin
Please note that the notebooks in the Data Collection and Initial Data Cleaning section are just demonstrations for early stage development; for each notebook there is a corresponding python script to handle repeated tasks.
-
Data Collection and Initial Data Cleaning
-
EDA
-
Modeling
In regions where rapid new urban development emerges, home prices tend to become less affordable in the initial phase of the development. This phenomenon occurred in many U.S. regions such as San Francisco, Denver, and NYC, where the residents' income level could not keep up with the fast increase in home prices.
Home affordability ratio is defined as the median home price and median annual income ratio. "Historically a house in the US cost around 3 to 4 times the median annual income. During the housing bubble of 2007 the ratio surpassed 5 - in other words, the median price for a single family home in the United States cost more than 5 times the US median annual household income"(reference). In recent years, many U.S. neighborhoods have become unaffordable for the residents.
In my project, I will explore whether commercial activities in a given neighborhood can be predictive for home affordability ratios. I will use the Google Places API to gather data about neighborhood businesses, such as types, opening hours, price level etc. To calculate Home affordability ratio, I will use the Census data from incomebyzipcode.com. The aim of the project is to develop a regression model that can make quick predictions given the latest commercial activities in a neighborhood. This approach has the advantage of being up-to-date, comparing to the traditional method based on census information. It can also serve as an early indicator and to be used by any municipality: if certain patterns of business activities emerge, the problem of a neighborhood becomes unaffordable might fellow.
I developed supervised regression models to predict affordability and used R2 score as the metric to measure the performance of the models.
During the data collection process, I encountered many limitations caused by the Google Places API, such as the limitation on the number of businesses returned per location and inaccuracy of the string search results. It is also worth noting that all Census data reply on ZIP Code Tabulation Areas (ZCTAs), which are different than zipcodes. When making Google API calls, ZCTAs were used instead of zipcodes.
During the EDA process, I found that the distribution of the target home affordability ratios is right-skewed. This confirmed the common knowledge: some areas in New York City (and New York State) are extremely unaffordable. I also discovered that most features do not have a linear relationship with the target; this might indicate that models that rely on linear regression may not perform well. When analyzing the correlation between the features and the target, I found that the feature 'open_now' was the top 1 feature that is positively correlated with the target. I further analyzed the pros and cons of using this feature in training the models.
The modeling process is divided into 3 phases:
Phase 1: the Naive Approach
The so-called “Naive Approach” is based on a simple idea: we want to
be able to use ALL the original observations to train the models.
Motivated by this idea, I engineered aggregated features based on each zipcode, and concatenated them back to the original data-frame.
In theory, the benefit of this approach is twofold:
- We would be able to capture the general information of each zipcode.
- We would be able to retain the same amount of data as the original dataset. This will ensure that we have sufficient amount data to train the models.
Phase 2: Aggregation
I used the aggregated observations by zipcode, combined with census data from
the Income dataset to train the models. I used the pattern sub-model
technique to handle missing data. This enabled
us to handle the missing data without imputation or dropping observations. 6 types of models were trained: Linear Regression (combined with various
regularization techniques), Polynomial Regression, KNN, Tree based models, SVR, and
Stochastic Gradient Descent. Each model was fitted with two datasets based on
the pattern defined by the sub-model method.
Phase 3: Generalization
In this phase, none of the features from the Census data were used to train the
models. The model was trained with New York State data. To test the
transferability of the model, the LA dataset was used on the trained model to make predictions.
The general workflow of this project is shown in the following flowchart.
I discovered that the naive approach led to data leakage issue, and therefore was invalid. However, this does NOT mean that in general we can't use aggregated features along with the original dataset; we just can't aggregate the observations the SAME way that we aggregated the target.
In Phase 2, I used the aggregated observations by zipcode, combined with census data from the Income dataset to train the model.
The pattern sub-model enabled me to handle missing data without imputation or dropping observations.
Among all the regression models, the BaggingRegressor
yields the best result
based on the test R2 score. However, even the best model still shows signs of
high variance, and the result is not optional.
Based on the model evaluation, features from the census data play an importance role in the model's performance. This finding posed challenges to my goal of developing a generalization of the model: i.e. using the model to predict home affordability ratio in other U.S. regions without retraining the model.
Without the census data, the performance of the model decreased. Training only with the Google Places API data, the LinearRegression
Model combined with L1 Regularization outperformed the others.
Using the trained LinearRegression
Model to make predictions for the LA
dataset, the model performed badly. Based on the test scores, as well as the
residuals plots, we can conclude that the model trained on the New York State
dataset is inadequate for the predication of home affordability ratios in LA, and that the trained model is not transferable.
This outcome makes intuitive sense: New York State and LA have very different commercial landscapes and demographics (such as median home prices and median annual income). The model that has been fitted and trained on the former is not able to capture the patterns of the latter.
Step 1: Improve data quality:
- Collect more data: during the early modeling process, I observed that the models' performance drastically improved after being trained with more data points (from only using NYC to NYS). Since the amount of observations (n = 1521), I will try to collect more data to improve the model's performance.
- Sampling data from different regions in the U.S, and stratify the samples: In order to help the model learn a wider range of dataset, I will try to randomly sample zipcodes across the States.
- If step 1 has been accomplished, and there is no significant improvement in the model's ability of making predictions, I will do the following:
Step 2: Reevaluate the assumptions:
- Research on what factors have been proven to be related to home affordability ratio: the first and foremost assumption of this project is that commercial activities can be predictive for home affordability ratio. This assumption is very likely to be unsound. During this project, I have confirmed that including census data will increase the model performance; in other words, the commercial activity information collected from the Google Places API alone are not as predictive as combing with census data. There might be many factors that link to home affordability ratio, and further research need to be conducted.
Using Google Place API has many limitations in the data collection process:
- When using string search, the results returned are unpredictable: e.g. when making an API call using param "stores near zipcode 10010", some unexpected result was returned, such as a school or a government office.
- The API only returns up to 3 calls per location (each location only returns up to 20 results). This limits the number of samples can be collect per location.
- The API returns 20 results per call, and it is unclear what algorithms was behind these results: such as proximities to the location centroid or popularity of the business as search result. So the businesses returned by the calls might be biased based on Google's algorithms.
- As mentioned in notebook 04, the 'open_now' feature is dependent on the time when the API calls are made.
- The Google Place API is not free: making many API calls will make the project very expensive. Therefore, the amount of data that can be collected can be very limited based on the budget of the project.
- Home Price-to-Income Ratios by Joint Center for Housing Studies of Harvard University
- Home Price to Income Ratio by longtermtrends.net
- The Impact Of Commercial Development On Surrounding Residential Property Values by Jonathan A. Wiley, Ph.D.
- Predicting Neighborhoods’ Socioeconomic Attributes Using Restaurant Data by Lei Dong, Carlo Ratti, and Siqi Zheng
- Big Data and Big Cities: The Promises and Limitations of Improved Measures of Urban Life by Edward L. Glaeser, Scott Duke Kominers, Michael Luca, Nikhil Naik