In this repository, we will see different methods to achieve similarity search across multiple languages. We aim to compare:
- LASER (Baseline)
- XLM MLM 100 languages
- XLM-R
- sentence transformers with multilingual models
We compare the below models on the publicly available newstest2012 from WMT 2012.
For each sentence in the source language, we calculate the closest sentence in the joint embedding space in the target language. If this sentence has the same index in the file, it is considered as correct, and as an error else wise. Therefore, the N-way parallel corpus should not contain duplicates.
- First, get the data from the section "Get data".
- Second, install the necessary tools such as tokenizers and BPE encoder.
- Third, preprocess the data (this step is not necessary for all models but interesting for the seek of experimenting).
- Then, get sentence embeddings with the method of your choice. In this section, we have not optimized the code so it can be very slow for XLM models. It is quite fast with sentence-transformers though. It could be useful to improve this part to get sentence embeddings faster ==> We could use larger batches (batch version is available but needs to be optimized). Feel free to raise an issue with suggestions.
- Finally, compute similarity search for each pair of languages.
cs | de | en | es | fr | avg | |
---|---|---|---|---|---|---|
cs | 0.00% | 0.70% | 0.90% | 0.67% | 0.77% | 0.76% |
de | 0.83% | 0.00% | 1.17% | 0.90% | 1.03% | 0.98% |
en | 0.93% | 1.27% | 0.00% | 0.83% | 1.07% | 1.02% |
es | 0.53% | 0.77% | 0.97% | 0.00% | 0.57% | 0.71% |
fr | 0.50% | 0.90% | 1.13% | 0.60% | 0.00% | 0.78% |
avg | 0.70% | 0.91% | 1.04% | 0.75% | 0.86% | 1.06% |
Czech language is currently not covered by sentence-transformers. More experiments are coming to do a thorough comparison. Hence, the following is a better view of what performance can achieve sentence-transformers using as base model 'distiluse-base-multilingual-cased'
cs | de | en | es | fr | avg | |
---|---|---|---|---|---|---|
cs | NA | NA | NA | NA | NA | NA |
de | NA | 0.00% | 1.40% | 1.60% | 1.53% | TODO |
en | NA | 1.47% | 0.00% | 1.33% | 0.97% | TODO |
es | NA | 1.80% | 1.10% | 0.00% | 1.03% | TODO |
fr | NA | 1.73% | 0.97% | 1.07% | 0.00% | TODO |
avg | NA | TODO | TODO | TODO | TODO | TODO |
Sentence-transformers outperform LASER on the pair (French, English), (English, French).
Otherwise, LASER shows higher performance.
Russian is considered in our analysis. LASER has not published in their repo the results for Russian. We will reproduce their experiments soon and include Russian for rigurous analysis.
langs | de | en | es | fr | ru | avg |
---|---|---|---|---|---|---|
de | 0.00% | 1.40% | 1.60% | 1.53% | 2.60% | 1.78% |
en | 1.47% | 0.00% | 1.33% | 0.97% | 0.50% | 1.07% |
es | 1.80% | 1.10% | 0.00% | 1.03% | 2.13% | 1.52% |
fr | 1.73% | 0.97% | 1.07% | 0.00% | 2.00% | 1.44% |
ru | 2.76% | 0.50% | 2.23% | 2.03% | 0.00% | 1.88% |
avg | 1.94% | 0.99% | 1.56% | 1.39% | 1.81% | 1.54% |
We can also do a Zero-Shot encoding and consider CS language.
The following shows such results:
langs | cs | de | en | es | fr | ru | avg |
---|---|---|---|---|---|---|---|
cs | 0.00% | 39.46% | 37.63% | 38.20% | 38.99% | 41.69% | 39.19% |
de | 36.10% | 0.00% | 1.40% | 1.60% | 1.53% | 2.60% | 8.64% |
en | 34.17% | 1.47% | 0.00% | 1.33% | 0.97% | 0.50% | 7.69% |
es | 35.36% | 1.80% | 1.10% | 0.00% | 1.03% | 2.13% | 8.29% |
fr | 35.96% | 1.73% | 0.97% | 1.07% | 0.00% | 2.00% | 8.34% |
ru | 37.96% | 2.76% | 0.50% | 2.23% | 2.03% | 0.00% | 9.10% |
avg | 35.91% | 9.44% | 8.32% | 8.88% | 8.91% | 9.78% | 13.54% |
With mean pooling, MAX_LEN = 100:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 84.85% 87.71% 86.85% 88.71% 87.03%
de 75.06% 0.00% 64.20% 70.36% 69.13% 69.69%
en 74.16% 54.28% 0.00% 24.84% 29.14% 45.60%
es 75.99% 81.62% 40.86% 0.00% 50.95% 62.35%
fr 90.08% 78.16% 37.53% 56.88% 0.00% 65.66%
avg 78.82% 74.73% 57.58% 59.73% 59.48% 66.07%
With mean pooling, MAX_LEN = 50:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 84.18% 85.98% 82.65% 89.08% 85.47%
de 74.79% 0.00% 61.51% 65.23% 67.27% 67.20%
en 75.72% 55.04% 0.00% 26.71% 31.60% 47.27%
es 78.39% 81.75% 37.40% 0.00% 55.54% 63.27%
fr 75.36% 63.20% 35.16% 34.33% 0.00% 52.01%
avg 76.07% 71.05% 55.01% 52.23% 60.87% 63.05%
With cls pooling, MAX_LEN = 100:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 89.84% 98.07% 89.88% 91.94% 92.43%
de 72.86% 0.00% 88.68% 71.33% 73.63% 76.62%
en 81.55% 64.47% 0.00% 42.99% 51.85% 60.21%
es 77.46% 78.82% 85.55% 0.00% 60.11% 75.48%
fr 75.42% 76.76% 83.82% 50.18% 0.00% 71.55%
avg 76.82% 77.47% 89.03% 63.59% 69.38% 75.26%
With cls pooling, MAX_LEN = 50:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 91.51% 98.30% 91.04% 93.81% 93.66%
de 75.02% 0.00% 89.68% 73.16% 76.92% 78.70%
en 82.78% 67.87% 0.00% 49.32% 56.51% 64.12%
es 78.69% 81.68% 86.38% 0.00% 64.84% 77.90%
fr 76.72% 80.22% 85.18% 54.18% 0.00% 74.08%
avg 78.31% 80.32% 89.89% 66.92% 73.02% 77.69%
With max_len = 100, (the higher the better according to experiments)
langs | cs | de | en | es | fr | ru | avg |
---|---|---|---|---|---|---|---|
cs | 0.00% | 74.46% | 74.63% | 71.86% | 47.45% | 42.72% | 62.22% |
de | 48.42% | 0.00% | 32.83% | 35.23% | 35.30% | 40.93% | 38.54% |
en | 53.28% | 37.70% | 0.00% | 22.91% | 25.57% | 24.18% | 32.73% |
es | 40.26% | 48.68% | 19.35% | 0.00% | 25.87% | 36.56% | 34.15% |
fr | 65.73% | 50.35% | 20.85% | 37.30% | 0.00% | 31.73% | 41.19% |
ru | 41.56% | 42.09% | 21.51% | 34.23% | 34.73% | 0.00% | 34.83% |
avg | 49.85% | 50.66% | 33.83% | 40.31% | 33.79% | 35.22% | 40.61% |
Without russian, (for easier comparison with Baseline)
langs | cs | de | en | es | fr | avg |
---|---|---|---|---|---|---|
cs | 0.00% | 74.46% | 74.63% | 71.86% | 47.45% | 67.10% |
de | 48.42% | 0.00% | 32.83% | 35.23% | 35.30% | 37.95% |
en | 53.28% | 37.70% | 0.00% | 22.91% | 25.57% | 34.87% |
es | 40.26% | 48.68% | 19.35% | 0.00% | 25.87% | 33.54% |
fr | 65.73% | 50.35% | 20.85% | 37.30% | 0.00% | 43.56% |
avg | 51.92% | 52.80% | 36.91% | 41.82% | 33.55% | 43.40% |
@TODO: CHECK WHY XLM-R shows such poor performance
Mean Pooling Strategy is the strategy with the best performance. Refer to this issue for comparison when using CLS Pooling Strategy. Input: MAX_LEN = 40, Mean Pooling Strategy
sh similarity_XLM-R_batch.sh 40 mean True
Output:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 91.31% 97.64% 98.37% 94.34% 95.41%
de 93.84% 0.00% 88.05% 92.87% 95.40% 92.54%
en 91.21% 77.79% 0.00% 72.56% 94.77% 84.08%
es 95.47% 93.14% 61.84% 0.00% 90.28% 85.18%
fr 91.97% 81.25% 77.12% 71.96% 0.00% 80.58%
avg 93.12% 85.87% 81.16% 83.94% 93.70% 87.56%
Input: MAX_LEN = 100, Mean Pooling Strategy
sh similarity_XLM-R_batch.sh 100 mean True
Output:
Confusion matrix:
langs cs de en es fr avg
cs 0.00% 91.24% 97.77% 97.80% 92.11% 94.73%
de 94.31% 0.00% 86.55% 92.44% 94.67% 91.99%
en 91.08% 74.93% 0.00% 71.56% 77.69% 78.81%
es 95.37% 88.38% 58.54% 0.00% 90.38% 83.17%
fr 92.71% 91.58% 87.41% 89.14% 0.00% 90.21%
avg 93.36% 86.53% 82.57% 87.74% 88.71% 87.78%
In the following, we show how to replicate such results.
For linux users: Use script wmt.sh
sh wmt.sh
We assume that tools like torch, tqdm, etc are already installed. On Google Collab, it is the case. (cf. examples --coming soon)
Use script install-tools.sh
sh install-tools.sh
When preprocessing data prior to encoding, we experienced lower performance. Hence, we do not recommend to do it. If one still wants to preprocess data, please use prepare-data-wmt bash file. Then, some small changes need to be added, so that input file can be found in the proper directory. We may add an argument for --input_file_name in the future.
Use script prepare-data-wmt.sh
sh prepare-data-wmt.sh
At the moment, sentence-transformers on "distiluse-base-multilingual-cased" only covers Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
Hence, Czech is not covered. We still decide to print the errors for language Czech for the seek of experiment on pure Zero-Shot learning. However, for comparison with baseline, we focus on other pairs of languages.
sh similarity_sentenceBERT.sh
In this section, we still need to update source code to make it faster. At the moment, we iterate over each line and build encoding on the fly which takes too much time.
MAX_LEN is an integer parameter that is describing the number of tokens to consider when encoding.
For each sentence with less tokens than MAX_LEN, we use zero-padding.
For each sentence with more token than MAX_LEN, we ignore all tokens after MAX_LEN index.
Default value is 40. Based on prior statistics on preprocessed data for French, we have:
- mean_value of length sentence: 26.6
- std_value of length sentence: 15.4
- max_value of length sentence: 145
We recommend to increase MAX_LEN for experiments. Larger MAX_LEN gives better performance but slower computation.
sh similarity_XLM.sh MAX_LEN
In this section, we still need to update source code to make it faster. At the moment, we iterate over each line and build encoding on the fly which takes too much time.
Parameters:
- --max_len: maximum length of the sentences
- --pooling_strat: pooling strategy in the set {cls, mean} at the moment. Needs to be optimized. Default value is cls embedding.
- --gpu: Use GPU if --gpu set to True and GPU support in your machine
sh similarity_XLM-R.sh MAX_LEN POOLING_STRAT
Batch version:
sh similarity_XLM-R_batch.sh MAX_LEN POOLING_STRAT GPU