This repository is a machine learning model for detecting hate speech on Twitter. The dataset consists of tweets that are labeled as either hate speech, offensive language, or neither.
The purpose of this model is to help identify and prevent hateful or discriminatory language on social media platforms. It can be used as a tool for moderators or analysts to track and flag harmful content.This is using to classify tweets having hate speech from others.
- Python 3
- Jupyter Notebook
- Pandas
- Scikit-learn
- NLTK
- Seaborn
- Tf-Idf
- Clone the repository to your local machine.
- Install Python 3, Jupyter Notebook, Pandas, Scikit-learn, NLTK, and Matplotlib.
- Open the notebook 'hate_speech_analysis.ipynb' in Jupyter Notebook.
- Run each cell in the notebook to load the data, preprocess the text, train and test the model, and analyze the results.
- The Train dataset contain 31962 Rows and having 3 columns ['id','label','tweet'].
- The Test dataset contain 17197 Rows and having 2 columns ['id','tweet'].
The text in the tweets is preprocessed using NLTK. The preprocessing steps include removing @ word,numbers,Greek Characters,hmm and it's forms,slang words,stop words, punctuation, and finding the # attached words.
In trainied 6 models on the given Trained dataset
- LogisticRegression
- Navie Bayes
- Random Forest
- SGD classifier
- SVM classfier
- GradientBoosting
Logistic Regresssion gives good accuaracy over all models.
The model used in this analysis is a Logistic Regression. The model is trained on the preprocessed text and the labels in the dataset.
The performance of the model is evaluated using precision, recall, and F1-score. The results are displayed in a confusion matrix and a classification report.
The Logistic Regression model achieves an F1-score of 0.77 for identifying hate speech.final_results.csv having the final results of given test dataset.
This machine learning model can be used to automatically detect hate speech and offensive language on Twitter. It can be a valuable tool for moderators or analysts to track and flag harmful content like sexism and racisim contents. However, the model is not perfect and may have limitations when dealing with different languages or dialects.