Skip to content

Commit

Permalink
chore; update make_multilingual.py (#2243)
Browse files Browse the repository at this point in the history
* chore; update make_multilingual.py

Update paths for parallel dataset

* Update broken paths throughout the examples

* Remove now-unused correct_bias

---------

Co-authored-by: Tom Aarsen <[email protected]>
  • Loading branch information
rufimelo99 and tomaarsen authored Dec 14, 2023
1 parent 3db309a commit 594143e
Show file tree
Hide file tree
Showing 6 changed files with 26 additions and 27 deletions.
2 changes: 1 addition & 1 deletion examples/evaluation/evaluation_translation_matching.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
python [model_name_or_path] [parallel-file1] [parallel-file2] ...
For example:
python distiluse-base-multilingual-cased TED2020-en-de.tsv.gz
python distiluse-base-multilingual-cased talks-en-de.tsv.gz
See the training_multilingual/get_parallel_data_...py scripts for getting parallel sentence data from different sources
"""
Expand Down
4 changes: 2 additions & 2 deletions examples/training/multilingual/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ In our experiments we initialized the student model with the multilingual XLM-Ro
## Training
For a **fully automatic code example**, see [make_multilingual.py](make_multilingual.py).

This scripts downloads the [TED2020 corpus](https://github.com/UKPLab/sentence-transformers/blob/master/docs/datasets/TED2020.md?), a corpus with transcripts and translations from TED and TEDx talks. It than extends a monolingual model to several languages (en, de, es, it, fr, ar, tr). TED2020 contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in your favorite languages.
This scripts downloads the parallel sentences corpus, a corpus with transcripts and translations from talks. It than extends a monolingual model to several languages (en, de, es, it, fr, ar, tr). This corpus contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in your favorite languages.



Expand Down Expand Up @@ -158,7 +158,7 @@ A great website for a vast number of parallel (translated) datasets is [OPUS](ht
The [examples/training/multilingual](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/multilingual/) folder contains some scripts that downloads parallel training data and brings it into the right format:
- [get_parallel_data_opus.py](get_parallel_data_opus.py): This script downloads data from the [OPUS](http://opus.nlpl.eu/) website.
- [get_parallel_data_tatoeba.py](get_parallel_data_tatoeba.py): This script downloads data from the [Tatoeba](https://tatoeba.org/) website, a website for language learners with example sentences for more than many languages.
- [get_parallel_data_ted2020.py](get_parallel_data_ted2020.py): This script downloads data the [TED2020 corpus](https://github.com/UKPLab/sentence-transformers/blob/master/docs/datasets/TED2020.md), which contains transcripts and translations of more than 4,000 TED and TEDx talks in 100+ languages.
- [get_parallel_data_talks.py](get_parallel_data_talks.py): This script downloads data the parallel sentences corpus, which contains transcripts and translations of more than 4,000 talks in 100+ languages.

## Evaluation

Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
"""
This script downloads the TED2020 corpus (https://github.com/UKPLab/sentence-transformers/blob/master/docs/datasets/TED2020.md)
and create parallel sentences tsv files that can be used to extend existent sentence embedding models to new languages.
This script downloads the parallel sentences corpus and create parallel sentences tsv files that can be used to extend
existent sentence embedding models to new languages.
The TED2020 corpus is a crawl of transcripts from TED and TEDx talks, which are translated to 100+ languages.
The parallel sentences corpus is a crawl of transcripts from talks, which are translated to 100+ languages.
The TED2020 corpus cannot be downloaded automatically. It is available for research purposes only (CC-BY-NC).
The parallel sentences corpus cannot be downloaded automatically. It is available for research purposes only (CC-BY-NC).
The training procedure can be found in the files make_multilingual.py and make_multilingual_sys.py.
Expand All @@ -24,17 +24,17 @@


dev_sentences = 1000 #Number of sentences we want to use for development
download_url = "" #Specify TED2020 URL here
ted2020_path = "../datasets/ted2020.tsv.gz" #Path of the TED2020.tsv.gz file.
download_url = "https://sbert.net/datasets/parallel-sentences.tsv.gz" #Specify parallel sentences URL here
parallel_sentences_path = "../datasets/parallel-sentences.tsv.gz" #Path of the parallel-sentences.tsv.gz file.
parallel_sentences_folder = "parallel-sentences/"




os.makedirs(os.path.dirname(ted2020_path), exist_ok=True)
if not os.path.exists(ted2020_path):
print("ted2020.tsv.gz does not exists. Try to download from server")
sentence_transformers.util.http_get(download_url, ted2020_path)
os.makedirs(os.path.dirname(parallel_sentences_path), exist_ok=True)
if not os.path.exists(parallel_sentences_path):
print("parallel-sentences.tsv.gz does not exists. Try to download from server")
sentence_transformers.util.http_get(download_url, parallel_sentences_path)



Expand All @@ -44,8 +44,8 @@
files_to_create = []
for source_lang in source_languages:
for target_lang in target_languages:
output_filename_train = os.path.join(parallel_sentences_folder, "TED2020-{}-{}-train.tsv.gz".format(source_lang, target_lang))
output_filename_dev = os.path.join(parallel_sentences_folder, "TED2020-{}-{}-dev.tsv.gz".format(source_lang, target_lang))
output_filename_train = os.path.join(parallel_sentences_folder, "talks-{}-{}-train.tsv.gz".format(source_lang, target_lang))
output_filename_dev = os.path.join(parallel_sentences_folder, "talks-{}-{}-dev.tsv.gz".format(source_lang, target_lang))
train_files.append(output_filename_train)
dev_files.append(output_filename_dev)
if not os.path.exists(output_filename_train) or not os.path.exists(output_filename_dev):
Expand All @@ -57,7 +57,7 @@

if len(files_to_create) > 0:
print("Parallel sentences files {} do not exist. Create these files now".format(", ".join(map(lambda x: x['src_lang']+"-"+x['trg_lang'], files_to_create))))
with gzip.open(ted2020_path, 'rt', encoding='utf8') as fIn:
with gzip.open(parallel_sentences_path, 'rt', encoding='utf8') as fIn:
reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
for line in tqdm(reader, desc="Sentences"):
for outfile in files_to_create:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import gzip

# Note: Tatoeba uses 3 letter languages codes (ISO-639-2),
# while other datasets like OPUS / TED2020 use 2 letter language codes (ISO-639-1)
# while other datasets like OPUS use 2 letter language codes (ISO-639-1)
# For training of sentence transformers, which type of language code is used doesn't matter.
# For language codes, see: https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
source_languages = set(['eng'])
Expand Down
15 changes: 7 additions & 8 deletions examples/training/multilingual/make_multilingual.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,8 @@
with the first column a sentence in a language understood by the teacher model, e.g. English,
and the further columns contain the according translations for languages you want to extend to.
This scripts downloads automatically the TED2020 corpus: https://github.com/UKPLab/sentence-transformers/blob/master/docs/datasets/TED2020.md
This corpus contains transcripts from
TED and TEDx talks, translated to 100+ languages. For other parallel data, see get_parallel_data_[].py scripts
This scripts downloads automatically the parallel sentences corpus. This corpus contains transcripts from
talks translated to 100+ languages. For other parallel data, see get_parallel_data_[].py scripts
Further information can be found in our paper:
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Expand Down Expand Up @@ -78,8 +77,8 @@ def download_corpora(filepaths):


# Here we define train train and dev corpora
train_corpus = "datasets/ted2020.tsv.gz" # Transcripts of TED talks, crawled 2020
sts_corpus = "datasets/STS2017-extended.zip" # Extended STS2017 dataset for more languages
train_corpus = "datasets/parallel-sentences.tsv.gz"
sts_corpus = "datasets/stsbenchmark.zip"
parallel_sentences_folder = "parallel-sentences/"

# Check if the file exists. If not, they are downloaded
Expand All @@ -93,8 +92,8 @@ def download_corpora(filepaths):
files_to_create = []
for source_lang in source_languages:
for target_lang in target_languages:
output_filename_train = os.path.join(parallel_sentences_folder, "TED2020-{}-{}-train.tsv.gz".format(source_lang, target_lang))
output_filename_dev = os.path.join(parallel_sentences_folder, "TED2020-{}-{}-dev.tsv.gz".format(source_lang, target_lang))
output_filename_train = os.path.join(parallel_sentences_folder, "talks-{}-{}-train.tsv.gz".format(source_lang, target_lang))
output_filename_dev = os.path.join(parallel_sentences_folder, "talks-{}-{}-dev.tsv.gz".format(source_lang, target_lang))
train_files.append(output_filename_train)
dev_files.append(output_filename_dev)
if not os.path.exists(output_filename_train) or not os.path.exists(output_filename_dev):
Expand Down Expand Up @@ -217,5 +216,5 @@ def download_corpora(filepaths):
evaluation_steps=num_evaluation_steps,
output_path=output_path,
save_best_model=True,
optimizer_params= {'lr': 2e-5, 'eps': 1e-6, 'correct_bias': False}
optimizer_params= {'lr': 2e-5, 'eps': 1e-6}
)
4 changes: 2 additions & 2 deletions examples/training/multilingual/make_multilingual_sys.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
with the first column a sentence in a language understood by the teacher model, e.g. English,
and the further columns contain the according translations for languages you want to extend to.
See get_parallel_data_[opus/tatoeba/ted2020].py for automatic download of parallel sentences datasets.
See get_parallel_data_[opus/tatoeba/talks].py for automatic download of parallel sentences datasets.
Note: See make_multilingual.py for a fully automated script that downloads the necessary data and trains the model. This script just trains the model if you have already parallel data in the right format.
Expand All @@ -23,7 +23,7 @@
python make_multilingual_sys.py train1.tsv.gz train2.tsv.gz train3.tsv.gz --dev dev1.tsv.gz dev2.tsv.gz
For example:
python make_multilingual_sys.py parallel-sentences/TED2020-en-de-train.tsv.gz --dev parallel-sentences/TED2020-en-de-dev.tsv.gz
python make_multilingual_sys.py parallel-sentences/talks-en-de-train.tsv.gz --dev parallel-sentences/talks-en-de-dev.tsv.gz
To load all training & dev files from a folder (Linux):
python make_multilingual_sys.py parallel-sentences/*-train.tsv.gz --dev parallel-sentences/*-dev.tsv.gz
Expand Down

0 comments on commit 594143e

Please sign in to comment.