Grab AI for S.E.A.

Note to Evaluators: GitHub does not allow pushing of file over 100MB, so the raw data has to be removed from the repo. Kindly download the raw data from https://s3-ap-southeast-1.amazonaws.com/grab-aiforsea-dataset/traffic-management.zip to proceed with my submission. Thank you.

This repo is the submission for Grab AI for S.E.A. - Traffic Management challenge.

Problem Statement
Solution Approach
Hypotheses
Features
Target
Model Algorithm
Train-Test-Split
Model Evaluation
Results
For Evaluators
Credits

Problem Statement

Traffic Management

Economies in Southeast Asia are turning to AI to solve traffic congestion, which hinders mobility and economic growth. The first step in the push towards alleviating traffic congestion is to understand travel demand and travel patterns within the city.

Can we accurately forecast travel demand based on historical Grab bookings to predict areas and times with high travel demand?

Solution Approach

Participants are given 2 months (61 days) of data in 15-minute intervals (buckets) and are required to forecast ahead for T+1 to T+5 given all data up to time T. There are 1,329 geographic location (in geohash6) to forecast. This is a multi-step time seris forecasting problem. A common approach is to train a model for each location. There are "missing data points" or gaps in the time series and we are required to treat them as zero aggregated demand. While it is possible to create a model for each location using ARIMA, it will consume a lot of time and resources to train over one thousand models for this challenge. Therefore, my approach is to train a single model to predict for multiple locations at the same time. This way, the model will have access to more data and able to learn the subtle patterns that repeat across locations. In effect, I am approaching the problem as a multivariate multi-step time series forecasting framed as a traditional supervised ML problem so that I can use traditional ML algorithms to solve it.

Hypotheses

By creating a single model for all locations, I am hypothesizing that there are certain features that will influence the travel demand across locations. For example, the time of the day (demand will rise during rush hours), and the clusters based on proximity (locations with close proximity to each other will have similar demand).

Features

The "day" and "timestamp" given can be used to create "day", "day of week", "day of month", "hour", "minute" features. Clustering using k-means on the latitude and longitude decoded from geohash6 can be used to create clusters of location with close proximity. Lastly, as time-series problem, the lagged time series will be created as features for the model to learn about past information by bringing them forward to the present.

Target

However, it should be pointed out that traditional ML models can only predict one value but in this challenge we are required to forecast 5 values. To do so, we will have to use 5 data points from the past. For example, we can train a model to predict 5 steps in the future. This means that given a time series of data up to T, the data point in T can be used to forecast T+5. In order to forecast T+1, T+2, T+3 and T+4 we will use the data points on T-4, T-3, T-2 and T-1. In effect, we are using T-4, T-3, T-2, T-1 and T to forecast ahead for T+1, T+2, T+3, T+4 and T+5, respectively.

Model Algorithm

The model will be trained using LightGBM and XGBoost. LightGBM works well on large dataset and is faster to train than XGBoost. It can work with integer-encoded categorical features without the need for one-hot encoding. It is also able to generate prediction intervals for the forecast (by means of quantile regression). We will train two models using LightGBM and XGBoost, and then combine the results by taking the average of the predictions (ensemble modeling).

Train-Test-Split

There are 61 days or 5856 buckets (after resampled), I will split 47 days (4512 buckets) as Training set and 14 days (1344 buckets) as Validation set, or roughly 77:23 ratio. Ideally, we should have a backtesting strategy using sliding window approach and the expanding window approach to test the model as suggested by Uber, but maybe in the next iteration.

Model Evaluation

As mentioned in the challenge, submissions will be evaluated by RMSE (root mean squared error) averaged over all geohash6, 15-minute-bucket pairs. I will use both MAE and RMSE to evaluate the model on the validation set.

Results

LightGBM performed the best with the lowest MAE of 0.0303 and RMSE of 0.0508 compared to XGBoost (MAE 0.0317, RMSE 0.0538). The top 5 most important features of the LightGBM model are hour, geo_encoded, demand, geo_cluster and day of week. Meanwhile, the top 5 most important features of the XGBoost model are hour, demand, geo_encoded, demand_t-1 and geo_cluster. The results showed that it is a correct approach to creature additional features using clustering of the locations and lagged time series as features.

Suggestions for improvement:

create more features e.g. more lagged series
use deep learning algorithms e.g. Long Short Term Memory (LSTM)
have a backtesting strategy using sliding window approach or the expanding window approach to test the model

For Evaluators

I have tried my best to write comment for every cell to explain what each cell does, my apology for frequent "# take a look" comments
Cells which require longer time to run (on my laptop) will have timer (%%time)
I am not sure about the hold-out test dataset details, my apology for not providing step-by-step guide for making prediction on the test set, but in general:
pre-process the test data into time series
create features from datetime ("day", "dow", "dom", "hour", "minute")
create features for lagged series (20 periods)
use the same geo_cluster labels since the set of geohashes are the same in training dataset and test dataset
make sure the shape of test set is (x, 28)
load the pre-trained model ("lgb.pkl")
make predictions
evaluate results

Credits

Since the challenge was announced, there has been several starter-kits made available online to help participants to overcome the cold-start problem. Particularly, I have benefited from the How-to guide from Husein Zolkepli and the Kaggle kernel of Mahadir Ahmad especially on decoding geohash6 that I would like to record my appreciation here. Thank you.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
clustering.png		clustering.png
comparison.png		comparison.png
grab-ai-for-sea.ipynb		grab-ai-for-sea.ipynb
grab-ai-for-sea_v2.ipynb		grab-ai-for-sea_v2.ipynb
lgb.pkl		lgb.pkl
lgbm_f.png		lgbm_f.png
xgb_f.png		xgb_f.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grab AI for S.E.A.

Problem Statement

Traffic Management

Solution Approach

Hypotheses

Features

Target

Model Algorithm

Train-Test-Split

Model Evaluation

Results

For Evaluators

Credits

Visualization:

About

Releases

Packages

Languages

limchiahooi/grab-ai-for-sea

Folders and files

Latest commit

History

Repository files navigation

Grab AI for S.E.A.

Traffic Management

Visualization:

About

Topics

Resources

Stars

Watchers

Forks

Languages