Skip to content

Sentiment classification with 2.5 million tweets.

Notifications You must be signed in to change notification settings

yehchunhung/tweet-sentiment

 
 

Repository files navigation

Tweet Sentiment Classification

This is the second project of the EPFL Machine Learning course, Fall 2019. In the project, we are given a dataset containing 2.5 millions tweets. Half of the tweets are labeled with positive sentiment and the rest are negative. Our task is to predict 10000 unlabeled tweets in the testing set.

Table of Contents

Dependencies

We implemented in Python 3. You will need the following dependencies installed:

Files Description

  • tfidf_word2vec/tf_idf.ipynb: Traning and testing procedure for simple ML models using TF-IDF matrx.
  • tfidf_word2vec/word2vec.ipynb: Traning and testing procedure for simple ML models using Word2Vec matrx.
  • tfidf_word2vec/helpers_simple_ml.py: Helpful functions used in tf_idf.ipynb and word2vec.ipynb.
  • bagging.ipynb: Simple voting (could be used after training and testing in bert_based.ipynb).
  • bert_based.ipynb: Traning and testing procedures for BERT based models.
  • fasttext/fasttext.ipynb: Traning and testing procedures for fasttext based model.
  • fasttext/helpers.py: Useful helper functions for the fasttext model
  • run.py: Codes to reproduce our result

Result

Steps to reproduce our result

There are two methods to reproduce our result

  1. Use trained models to get predictions and vote from them to get our best prediction. It takes about 2.5 hours to run with CPU. With GPU, it can be faster.
  2. Vote from predictions to get our best prediction. It takes only a few seconds to run.

Here are the steps:

  1. If you choose the second method skip step 2. and step 3.
  2. Download trained models through this Google Drive links and put them in a folder called models
  3. Change the test_data_dir argument (the directory of testing data)
  4. Execute the following command
    $ python3 run.py --test_model --test_data_dir 'data/test_data.txt'
    or
    $ python3 run.py --test_predictions
  5. The prediction will be saved as best_prediction.csv

Developers

@Kuan Tung @Chun-Hung Yeh @De-Ling Liu

References

For BERT based models:

  1. pytorch-sentiment-analysis Tutorial 6
  2. Class of transforming pandas DataFrame to torchtext Dataset
  3. Transformers documentation

License

Licensed under MIT License

About

Sentiment classification with 2.5 million tweets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 79.8%
  • Python 20.2%