Skip to content

This repo contains my submission for Grab AI for S.E.A. Traffic Management challenge.

Notifications You must be signed in to change notification settings

limchiahooi/grab-ai-for-sea

Repository files navigation

Grab AI for S.E.A.


Note to Evaluators: GitHub does not allow pushing of file over 100MB, so the raw data has to be removed from the repo. Kindly download the raw data from https://s3-ap-southeast-1.amazonaws.com/grab-aiforsea-dataset/traffic-management.zip to proceed with my submission. Thank you.

This repo is the submission for Grab AI for S.E.A. - Traffic Management challenge.

  1. Problem Statement
  2. Solution Approach
  3. Hypotheses
  4. Features
  5. Target
  6. Model Algorithm
  7. Train-Test-Split
  8. Model Evaluation
  9. Results
  10. For Evaluators
  11. Credits

Traffic Management

Economies in Southeast Asia are turning to AI to solve traffic congestion, which hinders mobility and economic growth. The first step in the push towards alleviating traffic congestion is to understand travel demand and travel patterns within the city.

Can we accurately forecast travel demand based on historical Grab bookings to predict areas and times with high travel demand?

Participants are given 2 months (61 days) of data in 15-minute intervals (buckets) and are required to forecast ahead for T+1 to T+5 given all data up to time T. There are 1,329 geographic location (in geohash6) to forecast. This is a multi-step time seris forecasting problem. A common approach is to train a model for each location. There are "missing data points" or gaps in the time series and we are required to treat them as zero aggregated demand. While it is possible to create a model for each location using ARIMA, it will consume a lot of time and resources to train over one thousand models for this challenge. Therefore, my approach is to train a single model to predict for multiple locations at the same time. This way, the model will have access to more data and able to learn the subtle patterns that repeat across locations. In effect, I am approaching the problem as a multivariate multi-step time series forecasting framed as a traditional supervised ML problem so that I can use traditional ML algorithms to solve it.

By creating a single model for all locations, I am hypothesizing that there are certain features that will influence the travel demand across locations. For example, the time of the day (demand will rise during rush hours), and the clusters based on proximity (locations with close proximity to each other will have similar demand).

The "day" and "timestamp" given can be used to create "day", "day of week", "day of month", "hour", "minute" features. Clustering using k-means on the latitude and longitude decoded from geohash6 can be used to create clusters of location with close proximity. Lastly, as time-series problem, the lagged time series will be created as features for the model to learn about past information by bringing them forward to the present.

However, it should be pointed out that traditional ML models can only predict one value but in this challenge we are required to forecast 5 values. To do so, we will have to use 5 data points from the past. For example, we can train a model to predict 5 steps in the future. This means that given a time series of data up to T, the data point in T can be used to forecast T+5. In order to forecast T+1, T+2, T+3 and T+4 we will use the data points on T-4, T-3, T-2 and T-1. In effect, we are using T-4, T-3, T-2, T-1 and T to forecast ahead for T+1, T+2, T+3, T+4 and T+5, respectively.

The model will be trained using LightGBM and XGBoost. LightGBM works well on large dataset and is faster to train than XGBoost. It can work with integer-encoded categorical features without the need for one-hot encoding. It is also able to generate prediction intervals for the forecast (by means of quantile regression). We will train two models using LightGBM and XGBoost, and then combine the results by taking the average of the predictions (ensemble modeling).

There are 61 days or 5856 buckets (after resampled), I will split 47 days (4512 buckets) as Training set and 14 days (1344 buckets) as Validation set, or roughly 77:23 ratio. Ideally, we should have a backtesting strategy using sliding window approach and the expanding window approach to test the model as suggested by Uber, but maybe in the next iteration.

As mentioned in the challenge, submissions will be evaluated by RMSE (root mean squared error) averaged over all geohash6, 15-minute-bucket pairs. I will use both MAE and RMSE to evaluate the model on the validation set.

LightGBM performed the best with the lowest MAE of 0.0303 and RMSE of 0.0508 compared to XGBoost (MAE 0.0317, RMSE 0.0538). The top 5 most important features of the LightGBM model are hour, geo_encoded, demand, geo_cluster and day of week. Meanwhile, the top 5 most important features of the XGBoost model are hour, demand, geo_encoded, demand_t-1 and geo_cluster. The results showed that it is a correct approach to creature additional features using clustering of the locations and lagged time series as features.

Suggestions for improvement:

  • create more features e.g. more lagged series
  • use deep learning algorithms e.g. Long Short Term Memory (LSTM)
  • have a backtesting strategy using sliding window approach or the expanding window approach to test the model
  • I have tried my best to write comment for every cell to explain what each cell does, my apology for frequent "# take a look" comments
  • Cells which require longer time to run (on my laptop) will have timer (%%time)
  • I am not sure about the hold-out test dataset details, my apology for not providing step-by-step guide for making prediction on the test set, but in general:
  • pre-process the test data into time series
  • create features from datetime ("day", "dow", "dom", "hour", "minute")
  • create features for lagged series (20 periods)
  • use the same geo_cluster labels since the set of geohashes are the same in training dataset and test dataset
  • make sure the shape of test set is (x, 28)
  • load the pre-trained model ("lgb.pkl")
  • make predictions
  • evaluate results

Since the challenge was announced, there has been several starter-kits made available online to help participants to overcome the cold-start problem. Particularly, I have benefited from the How-to guide from Husein Zolkepli and the Kaggle kernel of Mahadir Ahmad especially on decoding geohash6 that I would like to record my appreciation here. Thank you.



Visualization:


clustering

lgbm_f

xgb_f

comparison

About

This repo contains my submission for Grab AI for S.E.A. Traffic Management challenge.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published