LASER: application to bitext mining

This codes shows how to use the multilingual sentence embeddings to mine for parallel data in (huge) collections of monolingual data.

The underlying idea is pretty simple:

embed the sentences in the two languages into the joint sentence space
calculate all pairwise distances between the sentences. This is of complexity O(N*M) and can be done very efficiently with the FAISS library [2]
all sentence pairs which have a distance below a threshold are considered as parallel
this approach can be further improved using a margin criterion [3]

Here, we apply this idea to the data provided by the shared task of the BUCC Workshop on Building and Using Comparable Corpora.

The same approach can be scaled up to huge collections of monolingual texts (several billions) using more advanced features of the FAISS toolkit.

Installation

Please first download the BUCC shared task data here and install it the directory "downloaded"
running the script

./bucc.sh

Results

Optimized on the F-scores on the training corpus. These results differ slighty from those published in [4] due to the switch from PyTorch 0.4 to 1.0.

Languages	Threshold	precision	Recall	F-score
fr-en	1.088131	91.52	93.32	92.41
de-en	1.092056	95.65	95.19	95.42
ru-en	1.093404	90.60	94.04	92.29
zh-en	1.085999	91.99	91.31	91.65

Results on the official test set are scored by the organizers of the BUCC workshop.

Below, we compare our approach to the official results of the 2018 edition of the BUCC workshop [1]. More details on our approach are provided in [2,3,4]

System	fr-en	de-en	ru-en	zh-en
Azpeitia et al '17	79.5	83.7	-	-
Azpeitia et al '18	81.5	85.5	81.3	77.5
Bouamor and Sajjad '18	76.0	-	-	-
Chongman et al '18	-	-	-	56
LASER [3]	75.8	76.9	-	-
LASER [4]	93.1	96.2	92.3	92.7

All numbers are F1-scores on the test set.

Bonus

To show case the highly multilingual aspect of LASER's sentence embeddings, we also mine for bitexts for language pairs which do not include English, e.g. French-German, Russian-French or Chinese-Russian. This is also performed by the script bucc.sh

Below the number of extracted parallel sentences for each language pair.

src/trg	French	German	Russian	Chinese
French	n/a	2795	3327	387
German	2795	n/a	3661	466
Russian	3327	3661	n/a	664
Chinese	387	466	664	n/a

References

[1] Pierre Zweigenbaum, Serge Sharoff and Reinhard Rapp,` Overview of the Third BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora, LREC, 2018.

[2] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space, ACL, July 2018

[3] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, 3 Nov 2018.

[3] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, 26 Dec 2018.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LASER: application to bitext mining

Installation

Results

Bonus

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

LASER: application to bitext mining

Installation

Results

Bonus

References