Spark-ML-Projects

Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

Project Walk Through: YouTube

Project 1: Toxic Comment Classification

Introduction

This project explores basic text processing using the toxic comment text classification dataset. The primary objective was to convert the comment text column into a sparse vector representation to be utilized by a classification algorithm within the Spark ML library. (details in Dir)

Data Description

The data set is a large dataset originally released by Jigsaw and Google, formatted as a CSV file, with each row representing a unique comment. The columns in the dataset include id, comment_text, and other binary labels (0 or 1) indicating whether the comment falls into the respective toxicity category.

Model Explanation and Evaluation

The chosen model for this task was Logistic Regression, which was implemented using the PySpark 'LogisticRegression' class.

Project 2: Heart Disease Prediction using Logistic Regression

Introduction

This project involves a Python script that uses Logistic Regression to identify the most significant risk factors associated with heart disease and predicted overall risk levels. We used the Framingham Heart dataset. (details in Dir)

Dataset Description

The Framingham Heart Dataset originates from the Framingham Heart Study, with an initial cohort of 5209 subjects. The data typically includes various demographic information about patients and the outcome of the presence or absence of coronary heart disease.

Model Explanation and Results

The chosen model for this task was logistic regression, implemented using the PySpark 'LogisticRegression' class.

Project 3: Income Prediction with Logistic Regression on Spark: A Dive into Census Income Data

Project Description

This project showcases an adept implementation of Logistic Regression, utilizing Apache Spark ML/MLlib on UCI's Census Income Data. The goal was to predict income brackets, either >50K or <50K, based on a blend of 14 categorical, numerical, and missing attributes across 48,842 instances. (details in Dir)

Project 4: Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

Project Description

This project is a comprehensive demonstration of practical machine learning applications in large-scale data environments using Apache Spark ML/MLlib. The project involved the transformation of Python code to Apache Spark, then employed Logistic Regression in Spark ML/MLlib, concluding with the exploration of additional Spark ML algorithms, specifically the Random Forest and Decision Tree Classifiers. (details in Dir)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
p1		p1
p2		p2
p3		p3
p4		p4
LICENSE		LICENSE
README.md		README.md
Report.pdf		Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-ML-Projects

Project 1: Toxic Comment Classification

Introduction

Data Description

Model Explanation and Evaluation

Project 2: Heart Disease Prediction using Logistic Regression

Introduction

Dataset Description

Model Explanation and Results

Project 3: Income Prediction with Logistic Regression on Spark: A Dive into Census Income Data

Project Description

Project 4: Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

Project Description

Contributors

About

Releases

Packages

Languages

License

nickShengY/Spark-ML-Projects

Folders and files

Latest commit

History

Repository files navigation

Spark-ML-Projects

Project 1: Toxic Comment Classification

Introduction

Data Description

Model Explanation and Evaluation

Project 2: Heart Disease Prediction using Logistic Regression

Introduction

Dataset Description

Model Explanation and Results

Project 3: Income Prediction with Logistic Regression on Spark: A Dive into Census Income Data

Project Description

Project 4: Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

Project Description

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages