This project focuses on performing sentiment analysis on Twitter data using machine learning and deep learning models. The sentiment classes in the dataset are:
- Positive (1)
- Negative (2)
- Neutral (0)
The dataset consists of 3,534 rows with multiple columns such as:
Text: The tweet content Sentiment: Target labels (0: Neutral, 1: Positive, 2: Negative) Other columns like user details and geographic information are not used for model building but are part of the dataset.
Removing hashtags, URLs, mentions, and special characters from the tweets. Lowercasing all text for consistency.
TF-IDF vectorization was applied to convert the text data into numerical form for model training.
Multiple machine learning and deep learning models were trained to predict sentiment, including:
Logistic Regression Random Forest Support Vector Machine (SVM) AdaBoost Transformers (BERT) Data Augmentation
To improve model performance, various data augmentation techniques such as back translation and random word insertion were applied to the training set.
Model performance was evaluated using accuracy, confusion matrix, precision, recall, F1-score, and AUC.
Despite rigorous preprocessing and data augmentation, the models initially achieved accuracies in the range of 62-64%, with further improvements observed after applying back translation and hyperparameter tuning.
Python Scikit-learn for machine learning models Transformers (Hugging Face) for BERT-based models TF-IDF Vectorization for text data Data Augmentation to enhance training data GridSearchCV for hyperparameter tuning
Install the required packages:
pip install -r requirements.txt
Run the notebook to preprocess data, train models, and evaluate performance.