Title: Predicting Self-Rated Health in Seniors: Analysis of the 2015 BRFSS Survey With Machine Learning Techniques
Author: Madeleine May Kearns, BSc, MA
Data source: https://www.cdc.gov/brfss/annual_data/annual_2015.html
Background: Research has consistently shown that socioeconomic status, physical activity, smoking, diet, and alcohol consumption have a significant effect on health in older individuals. However, most of the current literature has evaluated the unique effect of isolated attributes. Currently, there are only a few studies assessing the cumulative effects of demographic characteristics and lifestyle behaviours on self-reported health in seniors.
Objective: This project uses the 2015 BRFSS survey to answer the following research questions:
- what demographic characteristics are associated with good health in individuals aged 60 years or older,
- what lifestyle behaviours are associated with good health in individuals aged 60 years or older, and
- what is the cumulative predictive power of demographic and lifestyle characteristics on good health in individuals aged 60 years or older?
Methodology overview: Cross-sectional analyses will be conducted to compare demographic characteristics and lifestyle behaviours between seniors who reported good and poor general health in the survey. For predictive modeling, classification approaches will be used to assess the cumulative predictive power of demographic and lifestyle characteristics on health in individuals aged 60 years or older. The best-performing model will be hyper-optimized and feature importance will be assessed to determine the variables that most strongly predict health in older individuals.
Specific methodological steps:
- Variable and respondent selection (age > 60 years and 43 variables)
- Removing missing data (respondents with >15% missing data)
- Outlier removal (winsorizing)
- Exploratory analyses (univariate, bivariate with general health)
- Normalization of the dataset, imputation, and test-train split
- Predictive modelling (classification)
- Feature importance and hyperoptimization of the best performing algorithm
- Multivariate analyses of top ten important features
To use this project: Please download the XPT file from the CDC link above. The code has been placed in an RMarkdown document which can be downloaded and used on your local environment. Begin with the data_cleaning file and then move on to the initial_results file.
Current Maintainer: Madeleine May Kearns