ml-travel-insurance

Fun project to model propensity to claim as a classifcation problem, and serve as a template for a more robust personal ML development framework, and of course to try out new cool tools.

docker workspaces in vscode
neptune.ai
buildkite

Data is originally from Kaggle. While the features are simple and the claims response is pretty straightforward, strangely many enthusiasts seem to use sales premium and commission dollars as a feature to predict claim lodgement.

While it is understood that this is a contrived example, it leads to a somewhat unrealistic and circular logic, given the commissions are based on premiums which are in turned based on the risk of a particular profile. So, using premiums to predict claims which are then used again to predict premiums isn't a very reliable strategy in the real world.

ML architecture

Local development architecture abstracts whole pipeline into 3 main pipeline components. Currently uses github actions for CI (including testing and linting), and trying out neptune.ai for experiment tracking. To try and implement buildkite personally for CD.

(back to top)

Data quality

Data preprocessing required for some of the erronous entries. Note that professional judgement was required for some of the decisions below.

removing non positive premiums
remove ages > 100
remove duration > 547
removed gender as a feature due to significant prportion of missing values
~8% of data points removed from cleaning

(back to top)

Data analysis

58525 data points after cleaning, 914 claim and ~1.5% claim frequency
severly imbalanced dataset requies over/under sampling techniques. SMOTE used here
Age seems to have a slight downward effect on claim frequency
Strange peak of exposure at 36 years

duration has a few unrealistic values upwards of 10 years, these have been removed and capped at 547 given the drop off in exposure for anything past
duration 365 has a claims high frequency compared to non-annual policies
destination has high cardinality ~top 20 countries capture ~90% of all data points and claims

(back to top)

Model results

Model was expectedly predicting all 'No' without any over/under sampling implemented. This resulted in a higher raw accuracy but 0 precision, with a weighted accuracy of ~50%

(back to top)

Local development

git clone https://github.com/jtsw1990/ml-travel-insurance.git
pip install -r requirements.txt after setting up your environment or docker workspace

(back to top)

Folder structure

assets contains images and other documentation related visuals
data contains the original .csv data file
interactive contains jupyter notebooks used for data analyses and charting
models contains outputs from training runs with prefix YYYYMMDDHMS corresponding to the neptune experiement ID
src contains the main pipeline scripts
- src/modules contains utility functions
- src/tests contains unit tests

(back to top)

To do

explore and implement data quality tracking in neptune.ai
integrate buildkite pipeline for CD

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.neptune/async		.neptune/async
assets		assets
data		data
interactive		interactive
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml-travel-insurance

Table of Contents

ML architecture

Data quality

Data analysis

Model results

Local development

Folder structure

To do

About

Releases

Packages

Languages

License

jtsw1990/ml-travel-insurance

Folders and files

Latest commit

History

Repository files navigation

ml-travel-insurance

Table of Contents

ML architecture

Data quality

Data analysis

Model results

Local development

Folder structure

To do

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages