Lung Cancer Prediction

Tina Lin • 12/2018

Data Source

The dataset that I use is a National Lung Screening Trail (NLST) Dataset that has 138 columns and 1,659 rows. 1,659 rows stand for 1,659 patients. There is a “class” column that stands for with lung cancer or without lung cancer. The other columns are features of the patients, such as “age”, “height”, “education”, etc. The dataset is provided by a professor at the State University of Arkansas and I am a remote volunteer for his lung cancer research project.

Motivation

Lung cancer causes more deaths than any other cancer. The odds for men is 1 in 13 while that for women is 1 in 16. Therefore, I want to create a model which can find the best features for lung cancer prediction. It can be used to aid the doctors in the decision making process and improve the disease identification process.

Methods

Feature Selection

RandomForest embedded pagckages in H2O
Weight of Evidence and Information Value (https://github.com/h2oai/h2o-meetups/blob/master/2017_11_29_Feature_Engineering/Feature%20Engineering.pdf) (better performance)

Models

DecisionTree Classifier
GradientBoosting Classifier
LogisticRegression Classifier
XGBoost Classifier

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lung Cancer Prediction

Tina Lin • 12/2018

Data Source

Motivation

Methods

Feature Selection

Models

About

Releases

Packages

tinalindata/lung_cancer_prediction

Folders and files

Latest commit

History

Repository files navigation

Lung Cancer Prediction

Tina Lin • 12/2018

Data Source

Motivation

Methods

Feature Selection

Models

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages