Create a text classification model that can predict the native language of an author who writes in English. With using the Google's BERT, the vector representations of the authors can be obtained which is fed to Neural Networks and other Prediction Models.
This task was created for the CS585 - Natural Language Processing Fall 2019 Course at Illinois Institute of Technology, Chicago.
- Clone the BERT repository and add it as a git submodule referred as
BERT_BASE_DIR
. - Use the BERT-Base Uncased model that is used as the data files. This is referred as
BERT_DATA_DIR
. Repo Link. - Download the dataset from the University Repo that is used for training, validation and testing. This is the
data
directory. - Run the
Format Data For Input.sh
that programmatically reformats the data files into thebert_input_data
and then run therun_bert_fv.sh
that obtains the feature vector representation for each data into thebert_output_data
directory. - Apply Prediction models for the prediction
Prediction Models.ipynb
file.
BERT_BASE_DIR (The files from the Google's BERT Submodule)
BERT_DATA_DIR (The files from the BERT-Base Uncased Model)
data (The dataset from the University Repository)
|--lang_id_train.csv
|--lang_id_eval.csv
|--lang_id_test.csv
bert_input_data (Formatted files for vector representation)
|--train.txt
|--eval.txt
|--test.txt
bert_output_data (Obtained feature vector representation)
|--train.jsonlines
|--eval.jsonlines
|--test.jsonlines
Format Data For Input.sh
run_bert_fv.sh
Prediction Models.ipynb