Skip to content

This repository contains the NLP project I completed during my Data Science Internship at Data Glacier Inc.

Notifications You must be signed in to change notification settings

VioletGo319/Hate_speech_detection

Repository files navigation

Sentiment Analysis on Twitter

Task: Classification

Highlights

  • Natural Language Processing (NLP)
  • BERT
  • Machine Learning
  • Deep Learning
  • Exploratory Data Analysis (EDA)

Data Source

The dataset is sourced from Kaggle and includes tweets from Twitter, labeled as positive (normal speech) or negative (hate speech).

  • Size: 50k rows x 2 columns (2/3 for training, 1/3 for testing)
  • Description: Each row represents a tweet and its corresponding classification as positive or negative.

Approaches of Analysis

Task

Using BERT encoder to vectorize the tweets, followed by classification using logistic regression, random forest, neural network, and BERT models to distinguish between normal speech and hate speech, aiming to improve the social media environment.

Data Preprocessing

  1. Convert to lowercase
  2. Remove numbers
  3. Remove punctuation
  4. Remove whitespaces
  5. Remove non-ASCII characters
  6. Remove HTML characters
  7. Tokenization
  8. Remove stopwords
  9. Stemming
  10. Rejoin tokens

Visualization

  1. Top 25 Words

Sample Image

  1. Tweet Length Distribution

Sample Image

  1. Word Clouds (General)

Sample Image

  1. Word Clouds (Hate Speech)

Sample Image

  1. Words of Hate Topics

Sample Image

Models Used

Using BERT encoder to vectorize the tweets, followed by classification using logistic regression, random forest, neural network, and BERT models to distinguish between normal speech and hate speech, aiming to improve the social media environment.

Evaluation

Plot ROC-AUC curve to evaluate the model performance.
Sample Image

About

This repository contains the NLP project I completed during my Data Science Internship at Data Glacier Inc.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published