This codes shows how to use the multilingual sentence embeddings to mine for parallel data in (huge) collections of monolingual data.
The underlying idea is pretty simple:
- embed the sentences in the two languages into the joint sentence space
- calculate all pairwise distances between the sentences. This is of complexity O(N*M) and can be done very efficiently with the FAISS library [2]
- all sentence pairs which have a distance below a threshold are considered as parallel
- this approach can be further improved using a margin criterion [3]
Here, we apply this idea to the data provided by the shared task of the BUCC Workshop on Building and Using Comparable Corpora.
The same approach can be scaled up to huge collections of monolingual texts (several billions) using more advanced features of the FAISS toolkit.
- Please first download the BUCC shared task data here and install it the directory "downloaded"
- running the script
./bucc.sh
Optimized on the F-scores on the training corpus. These results differ slighty from those published in [4] due to the switch from PyTorch 0.4 to 1.0.
Languages | Threshold | precision | Recall | F-score |
---|---|---|---|---|
fr-en | 1.088131 | 91.52 | 93.32 | 92.41 |
de-en | 1.092056 | 95.65 | 95.19 | 95.42 |
ru-en | 1.093404 | 90.60 | 94.04 | 92.29 |
zh-en | 1.085999 | 91.99 | 91.31 | 91.65 |
Results on the official test set are scored by the organizers of the BUCC workshop.
Below, we compare our approach to the official results of the 2018 edition of the BUCC workshop [1]. More details on our approach are provided in [2,3,4]
System | fr-en | de-en | ru-en | zh-en |
---|---|---|---|---|
Azpeitia et al '17 | 79.5 | 83.7 | - | - |
Azpeitia et al '18 | 81.5 | 85.5 | 81.3 | 77.5 |
Bouamor and Sajjad '18 | 76.0 | - | - | - |
Chongman et al '18 | - | - | - | 56 |
LASER [3] | 75.8 | 76.9 | - | - |
LASER [4] | 93.1 | 96.2 | 92.3 | 92.7 |
All numbers are F1-scores on the test set.
To show case the highly multilingual aspect of LASER's sentence embeddings, we also mine for bitexts for language pairs which do not include English, e.g. French-German, Russian-French or Chinese-Russian. This is also performed by the script bucc.sh
Below the number of extracted parallel sentences for each language pair.
src/trg | French | German | Russian | Chinese |
---|---|---|---|---|
French | n/a | 2795 | 3327 | 387 |
German | 2795 | n/a | 3661 | 466 |
Russian | 3327 | 3661 | n/a | 664 |
Chinese | 387 | 466 | 664 | n/a |
[1] Pierre Zweigenbaum, Serge Sharoff and Reinhard Rapp,` Overview of the Third BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora, LREC, 2018.
[2] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space, ACL, July 2018
[3] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, 3 Nov 2018.
[3] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, 26 Dec 2018.