This notebook will introduce some foundation machine learning and data science concepts by exploring the problem of heart disease classification. K-Nearest Neighbor, Logistic Regression and Random Forest models were used and their results were compared. According to the results, Logistic Regression model performed with best accuracy on the heart disease dataset.
- Exploratory data analysis (EDA) - the process of going through a dataset and finding out more about it.
- Model training - create model(s) to learn to predict a target variable based on other variables.
- Model evaluation - evaluating a models predictions using problem-specific evaluation metrics.
- Model comparison - comparing several different models to find the best one.
- Model fine-tuning - once we've found a good model, how can we improve it?
- Feature importance - since we're predicting the presence of heart disease, are there some things which are more important for prediction?
- Cross-validation - if we do build a good model, can we be sure it will work on unseen data?
- Reporting what we've found - if we had to present our work, what would we show someone?
To work through these topics,
- pandas
- Matplotlib
- NumPy
- seaborn
were used for data anaylsis, as well as,
- Scikit-Learn
for machine learning and modelling tasks.