Comparing Finnish sentence embedding methods

The purpose of this repository is to compare sentence embedding models for Finnish and understand if the methods, which are known to perform well on English language, are useful on Finnish, too.

Sentence embeddings are natural language processing algorithms that map textual sentences into numerical vectors. Vectors are supposed to capture the meaning of the sentence. The embeddings can be used to compare sentences: if two sentences express a similar idea using different words, the corresponding embedding vectors should still be close to each other. Sentence embeddings are have been found to improve performance on many NLP tasks, such as sentiment analysis and machine translation.

The training of the embedding models usually requires very large text corpora and significant computing power. Researchers have, however, published pre-trained models which can be adapted to various downstream tasks with reasonable low effort. Pre-trained sentence embeddings are typically used as input features to a neural network (or other machine learning model). Only the task-specific model is trained while the sentence embedding model if kept fixed.

Researchers have so far focused mostly on English and other most spoken languages. However, there have been a few pre-trained models published for Finnish (or rather multilingual models that include Finnish). This analysis will compare the published Finnish models.

Models included in the comparison:

TF-IDF
Average-pooled word2vec trained on the Finnish Internet Parsebank
Average-pooled multilingual FastText
FinBERT
Smoothed Inverse Frequency weighting (SIF) of word embeddings
Bag of embedding projections (BOREP)
LASER - Language-Agnostic SEntence Representations

Results

Read a report on the study results.

Download datasets and pre-trained models

./scripts/get_data.sh

Run

pipenv run scripts/run.sh

The results are written to results/scores.{csv, png}.

Hyperparameter optimization

pipenv run scripts/tune_hyperparameters.sh

The optimal hyperparameters are written to results/hyperparameters.json.

To take the tuned parameters into use, copy the file to the models subdirectory:

cp results/hyperparameters.json models/hyperparameters.json

Refreshing the report

The Markdown source files for the report are located at docs-source and the generated HTML files at docs. The report is hosted on Github pages.

Generating the report requires pandoc-scholar.

cd docs-source
make
git push  ## Updates the public web pages

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
data/UD_Finnish-TDT		data/UD_Finnish-TDT
docs-source		docs-source
docs		docs
exploration		exploration
fiSentenceEmbeddingEval		fiSentenceEmbeddingEval
models		models
scores		scores
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparing Finnish sentence embedding methods

Results

Download datasets and pre-trained models

Run

Hyperparameter optimization

Refreshing the report

About

Languages

License

aajanki/fi-sentence-embeddings-eval

Folders and files

Latest commit

History

Repository files navigation

Comparing Finnish sentence embedding methods

Results

Download datasets and pre-trained models

Run

Hyperparameter optimization

Refreshing the report

About

Topics

Resources

License

Stars

Watchers

Forks

Languages