Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers
Project Walk Through: YouTube
This project explores basic text processing using the toxic comment text classification dataset. The primary objective was to convert the comment text column into a sparse vector representation to be utilized by a classification algorithm within the Spark ML library. (details in Dir)
The data set is a large dataset originally released by Jigsaw and Google, formatted as a CSV file, with each row representing a unique comment. The columns in the dataset include id, comment_text, and other binary labels (0 or 1) indicating whether the comment falls into the respective toxicity category.
The chosen model for this task was Logistic Regression, which was implemented using the PySpark 'LogisticRegression' class.
This project involves a Python script that uses Logistic Regression to identify the most significant risk factors associated with heart disease and predicted overall risk levels. We used the Framingham Heart dataset. (details in Dir)
The Framingham Heart Dataset originates from the Framingham Heart Study, with an initial cohort of 5209 subjects. The data typically includes various demographic information about patients and the outcome of the presence or absence of coronary heart disease.
The chosen model for this task was logistic regression, implemented using the PySpark 'LogisticRegression' class.
This project showcases an adept implementation of Logistic Regression, utilizing Apache Spark ML/MLlib on UCI's Census Income Data. The goal was to predict income brackets, either >50K or <50K, based on a blend of 14 categorical, numerical, and missing attributes across 48,842 instances. (details in Dir)
Project 4: Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers
This project is a comprehensive demonstration of practical machine learning applications in large-scale data environments using Apache Spark ML/MLlib. The project involved the transformation of Python code to Apache Spark, then employed Logistic Regression in Spark ML/MLlib, concluding with the exploration of additional Spark ML algorithms, specifically the Random Forest and Decision Tree Classifiers. (details in Dir)