The purpose of this repository is to compare sentence embedding models for Finnish and understand if the methods, which are known to perform well on English language, are useful on Finnish, too.
Sentence embeddings are natural language processing algorithms that map textual sentences into numerical vectors. Vectors are supposed to capture the meaning of the sentence. The embeddings can be used to compare sentences: if two sentences express a similar idea using different words, the corresponding embedding vectors should still be close to each other. Sentence embeddings are have been found to improve performance on many NLP tasks, such as sentiment analysis and machine translation.
The training of the embedding models usually requires very large text corpora and significant computing power. Researchers have, however, published pre-trained models which can be adapted to various downstream tasks with reasonable low effort. Pre-trained sentence embeddings are typically used as input features to a neural network (or other machine learning model). Only the task-specific model is trained while the sentence embedding model if kept fixed.
Researchers have so far focused mostly on English and other most spoken languages. However, there have been a few pre-trained models published for Finnish (or rather multilingual models that include Finnish). This analysis will compare the published Finnish models.
Models included in the comparison:
- TF-IDF
- Average-pooled word2vec trained on the Finnish Internet Parsebank
- Average-pooled multilingual FastText
- FinBERT
- Smoothed Inverse Frequency weighting (SIF) of word embeddings
- Bag of embedding projections (BOREP)
- LASER - Language-Agnostic SEntence Representations
Read a report on the study results.
./scripts/get_data.sh
pipenv run scripts/run.sh
The results are written to results/scores.{csv, png}.
pipenv run scripts/tune_hyperparameters.sh
The optimal hyperparameters are written to results/hyperparameters.json.
To take the tuned parameters into use, copy the file to the models subdirectory:
cp results/hyperparameters.json models/hyperparameters.json
The Markdown source files for the report are located at docs-source and the generated HTML files at docs. The report is hosted on Github pages.
Generating the report requires pandoc-scholar.
cd docs-source
make
git push ## Updates the public web pages