Here, I am aim to analyze the Road Safety and Traffic Demographics dataset (UK), containing accidents reported by the police between the years of 2004 - 2017.
- Identify factors responsible for most of the reported accidents.
- Build a machine learning model that is capable of accurately predicting the severity of an accident.
- Provide recommendations to the Department of Transport (UK Government), to improve road safety policies and prevent recurrences of severe accidents where possible.
- Scikit-learn, numpy, pandas, imblearn (imbalanced-learn), seaborn, Matplotlib
World Health Organization (WHO) reported that more than 1.25 million people die each year while 50 million are injured as a result of road accidents worldwide. Road accidents are the 10th leading cause of death globally. On current trends, road traffic accidents are to become the 7th leading cause of death by 2030 making it a major public health concern. Between the years 2005 and 2016, there were roughly 2 million road accidents reported in the United Kingdom (UK) alone of which 16,000 were fatal.
As a big data project, I wanted to explore the traffic demographics data in greater detail using machine learning!
The UK government amassed traffic data from 2004 to 2017, recording over 2 million accidents in the process and making this one of the most comprehensive traffic data sets out there. It's a huge picture of a country undergoing change.
Note that all the contained accident data comes from police reports, so this data does not include minor incidents.
- For steps undertaken to pre-process and clean the data, please view the "Data Cleansing & Descriptive Analysis_UK Traffic Demographics.ipynb" file
- Tools used include Python, Tableau, MS PowerBI
As seen above, the data is highly imbalanced.
- For detailed steps undertaken to deal with the imbalanced data, please view the "Modelling_Predictive Analytics_UK Traffic Demographics.ipynb" file.
This article provides some great tips on utilizing the correct performance metrics when analyzing a models performance trained on an imbalanced dataset.
This article describes several strategies that can help combat the case of a severly imbalanced dataset. Methods include:
- Resampling strategies (under - Tomek Links, Cluster Centroids, over sampling - SMOTE)
- Using Decision Tree based models
- Using Cost-Sensitive training (Penalize algorithms)
It can be seen above that the trend seems to be increasing as the years go. In addition, the spike between 2008 - 2009 was because of a enhancement in the reporting system introduced in the UK in 2009, where all accident including minor accidents needed to be reported by the police so as to match the counts represented by hospitals, insurance claims etc.
Most accidents took place in major cities - Birmingham, London, leeds, Newcastle
Most accidents take place on a Friday
Most accidents take place as a result of overtaking
- For steps undertaken to carry out some predictive modeling and hyper-parameter tuning, please view the "Modelling_Predictive Analytics_UK Traffic Demographics.ipynb" file.
- Decrease emergency response times during afternoon rush-hours (15-19) especially on Fridays.
- Allocate resources to investigate high density traffic points and identify new infrastructure needs to divert traffic from dual-carriage ways.
- Explore conditions of vehicles and casualties such as vehicle type, age of vehicles registered, pedestrian movements, etc. for policy makers.
- Adopt comprehensive distracted driving laws that increase penalties for drivers who commit traffic violations like aggressive overtaking.
The license for this dataset is the Open Givernment Licence used by all data on data.gov.uk. The raw datasets are available from the UK Department of Transport website.