This is the second project of the EPFL Machine Learning course, Fall 2019. In the project, we are given a dataset containing 2.5 millions tweets. Half of the tweets are labeled with positive sentiment and the rest are negative. Our task is to predict 10000 unlabeled tweets in the testing set.
We implemented in Python 3. You will need the following dependencies installed:
-
$ pip install nltk
-
$ pip install gensim
-
$ pip install fasttext
-
$ pip install torchtext
-
$ pip install transformers
-
$ pip install tqdm
tfidf_word2vec/tf_idf.ipynb
: Traning and testing procedure for simple ML models using TF-IDF matrx.tfidf_word2vec/word2vec.ipynb
: Traning and testing procedure for simple ML models using Word2Vec matrx.tfidf_word2vec/helpers_simple_ml.py
: Helpful functions used in tf_idf.ipynb and word2vec.ipynb.bagging.ipynb
: Simple voting (could be used after training and testing in bert_based.ipynb).bert_based.ipynb
: Traning and testing procedures for BERT based models.fasttext/fasttext.ipynb
: Traning and testing procedures for fasttext based model.fasttext/helpers.py
: Useful helper functions for the fasttext modelrun.py
: Codes to reproduce our result
- AIcrowd competition link: https://www.aicrowd.com/challenges/epfl-ml-text-classification-01b777b0-a83a-412a-b6f8-f3dc53cb1bce
- Group name: TWN1
- Leaderboard
- 0.909 of categorical accuracy.
- 0.909 of F1 score.
There are two methods to reproduce our result
- Use trained models to get predictions and vote from them to get our best prediction. It takes about 2.5 hours to run with CPU. With GPU, it can be faster.
- Vote from predictions to get our best prediction. It takes only a few seconds to run.
Here are the steps:
- If you choose the second method skip step 2. and step 3.
- Download trained models through this Google Drive links and put them in a folder called
models
- Change the test_data_dir argument (the directory of testing data)
- Execute the following command
$ python3 run.py --test_model --test_data_dir 'data/test_data.txt' or $ python3 run.py --test_predictions
- The prediction will be saved as
best_prediction.csv
@Kuan Tung @Chun-Hung Yeh @De-Ling Liu
For BERT based models:
- pytorch-sentiment-analysis Tutorial 6
- Class of transforming pandas DataFrame to torchtext Dataset
- Transformers documentation
Licensed under MIT License