CS 89.21 Data Mining and Knowledge Discovery
Final Project Winter 2021
Yakoob Khan, Luca C. L. Lit, Louis Murerwa, Aadil Islam
This repository contains code used to scrap tweets related to anti-asian rhetoric in March 2021. We used the Toxic Comments Classification Challenge data provided by Jigsaw/Conversational AI team on Kaggle to train classification models to filter toxic comments. Finally, we explore a simple rules-based word substitution technique to neutralize the offensive language in comments that our best classification model (BERT) predicts as toxic.
├── Tweet-Scraping # Code to scrape posts using Tweepy API
├── checkpoints # Contain intermediate predictions
├── bert_model_cased_not_lowered
├── data # Training data
├── deep_learning_based_models # BERT Classification
├── style-transfer # Text Style Transfer
├── README.md
├── machinelearningmodles # Sci-kit Learn Classical ML Models
└── requirements.txt # dependencies for reproducibility
- Sci-kit Learn
- Pytorch
- Hugging Face Transformer's Library
We thank Professor Soroush Vosoughi and the Teaching Assistant team for the wonderful course on Data Mining and Knowledge Discovery (Winter 2021).