Tweet Sentiment Classification

This is the second project of the EPFL Machine Learning course, Fall 2019. In the project, we are given a dataset containing 2.5 millions tweets. Half of the tweets are labeled with positive sentiment and the rest are negative. Our task is to predict 10000 unlabeled tweets in the testing set.

Dependencies

We implemented in Python 3. You will need the following dependencies installed:

NLTK
```
$ pip install nltk
```
Gensim
```
$ pip install gensim
```
FastText
```
$ pip install fasttext
```
Torchtext
```
$ pip install torchtext
```
Transformers
```
$ pip install transformers
```
tqdm
```
$ pip install tqdm
```

Files Description

tfidf_word2vec/tf_idf.ipynb: Traning and testing procedure for simple ML models using TF-IDF matrx.
tfidf_word2vec/word2vec.ipynb: Traning and testing procedure for simple ML models using Word2Vec matrx.
tfidf_word2vec/helpers_simple_ml.py: Helpful functions used in tf_idf.ipynb and word2vec.ipynb.
bagging.ipynb: Simple voting (could be used after training and testing in bert_based.ipynb).
bert_based.ipynb: Traning and testing procedures for BERT based models.
fasttext/fasttext.ipynb: Traning and testing procedures for fasttext based model.
fasttext/helpers.py: Useful helper functions for the fasttext model
run.py: Codes to reproduce our result

Result

AIcrowd competition link: https://www.aicrowd.com/challenges/epfl-ml-text-classification-01b777b0-a83a-412a-b6f8-f3dc53cb1bce
Group name: TWN1
Leaderboard
- 0.909 of categorical accuracy.
- 0.909 of F1 score.

Steps to reproduce our result

There are two methods to reproduce our result

Use trained models to get predictions and vote from them to get our best prediction. It takes about 2.5 hours to run with CPU. With GPU, it can be faster.
Vote from predictions to get our best prediction. It takes only a few seconds to run.

Here are the steps:

If you choose the second method skip step 2. and step 3.
Download trained models through this Google Drive links and put them in a folder called models
Change the test_data_dir argument (the directory of testing data)

Execute the following command

$ python3 run.py --test_model --test_data_dir 'data/test_data.txt'
or
$ python3 run.py --test_predictions

The prediction will be saved as best_prediction.csv

Developers

@Kuan Tung @Chun-Hung Yeh @De-Ling Liu

References

For BERT based models:

License

Licensed under MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet Sentiment Classification

Table of Contents

Dependencies

Files Description

Result

Steps to reproduce our result

Developers

References

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
fasttext		fasttext
predictions		predictions
predictions_test		predictions_test
tfidf_word2vec		tfidf_word2vec
.gitignore		.gitignore
README.md		README.md
bagging.ipynb		bagging.ipynb
bert_based.ipynb		bert_based.ipynb
run.py		run.py

yehchunhung/tweet-sentiment

Folders and files

Latest commit

History

Repository files navigation

Tweet Sentiment Classification

Table of Contents

Dependencies

Files Description

Result

Steps to reproduce our result

Developers

References

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages