-
Notifications
You must be signed in to change notification settings - Fork 1
/
tests.txt
84 lines (62 loc) · 6.46 KB
/
tests.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
To be run inside the docker
** FORMAT TESTS **
* TSV:
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-3612-presscorner_covid/v1/moses/en-es.txt.zip && unzip en-es.txt.zip && rm README LICENSE ELRC-3612-presscorner_covid.en-es.xml en-es.txt.zip && paste ELRC-3612-presscorner_covid.en-es.en ELRC-3612-presscorner_covid.en-es.es > ELRC.covid.en-es && rm ELRC-3612-presscorner_covid.en-es.en ELRC-3612-presscorner_covid.en-es.es && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.covid.en-es ./yaml_dir/ELRC.covid.en-es.yaml en es tsv parallel
* TSV, REVERSED (special case "es-en")
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-3612-presscorner_covid/v1/moses/en-es.txt.zip && unzip en-es.txt.zip && rm README LICENSE ELRC-3612-presscorner_covid.en-es.xml en-es.txt.zip && paste ELRC-3612-presscorner_covid.en-es.es ELRC-3612-presscorner_covid.en-es.en > ELRC.covid.es-en && rm ELRC-3612-presscorner_covid.en-es.es ELRC-3612-presscorner_covid.en-es.en && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.covid.es-en ./yaml_dir/ELRC.covid.es-en.yaml es en tsv parallel
* TMX:
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-2638-monumentos_2007/v1/tmx/en-es.tmx.gz && zcat en-es.tmx.gz > ELRC.monumentos.en-es && rm en-es.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.monumentos.en-es ./yaml_dir/ELRC.monumentos.en-es.yaml en es tmx parallel
* BITEXTS
//Coming soon...
* MONO:
cd /work/uploaded_corpora/ && wget https://object.pouta.csc.fi/OPUS-ELRC-3077-wikipedia_health/v1/mono/en.txt.gz && zcat en.txt.gz > ELRC.health.en && rm en.txt.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.health.en ./yaml_dir/ELRC.health.en.yaml en - tsv mono
** LANGUAGE COMBINATION TESTS (RESULTING IN DIFFERENT BICLEANER STRATEGIES) **
* CLASSIC BICLEANER WITH EN (en-it)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-vaccination/v1/tmx/en-it.tmx.gz && zcat en-it.tmx.gz > ELRC.vaccination.en-it && rm en-it.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.vaccination.en-it ./yaml_dir/ELRC.vaccination.en-it.yaml en it tmx parallel
* CLASSIC BICLEANER WITH EN, REVERSED (it-en)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-vaccination/v1/tmx/en-it.tmx.gz && zcat en-it.tmx.gz > ELRC.vaccination.it-en && rm en-it.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.vaccination.it-en ./yaml_dir/ELRC.vaccination.it-en.yaml it en tmx parallel
* CLASSIC BICLEANER WITH ES (es-ca)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-5190-Cyber_MT_Test/v1/tmx/ca-es.tmx.gz && zcat ca-es.tmx.gz > ELRC.cyber.es-ca && rm ca-es.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.cyber.es-ca ./yaml_dir/ELRC.cyber.es-ca.yaml es ca tmx parallel
* CLASSIC BICLEANER WITH ES, REVERSED (ca-es)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-5190-Cyber_MT_Test/v1/tmx/ca-es.tmx.gz && zcat ca-es.tmx.gz > ELRC.cyber.ca-es && rm ca-es.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.cyber.ca-es ./yaml_dir/ELRC.cyber.ca-es.yaml ca es tmx parallel
* BICLEANER AI WITH EN (en-eu)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-4993-Basque_Wikinews_MT/v1/tmx/en-eu.tmx.gz && zcat en-eu.tmx.gz > ELRC.wikinews.en-eu && rm en-eu.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.wikinews.en-eu ./yaml_dir/ELRC.wikinews.en-eu.yaml en eu tmx parallel
* BICLEANER AI WITH EN, REVERSED (eu-en)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-ELRC-4993-Basque_Wikinews_MT/v1/tmx/en-eu.tmx.gz && zcat en-eu.tmx.gz > ELRC.wikinews.eu-en && rm en-eu.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.wikinews.eu-en ./yaml_dir/ELRC.wikinews.eu-en.yaml eu en tmx parallel
* BICLEANER AI WITH ES (es-zh)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-tico-19/v2020-10-28/tmx/es-zh.tmx.gz && zcat es-zh.tmx.gz > tico19.es-zh && rm es-zh.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/tico19.es-zh ./yaml_dir/tico19.es-zh.yaml es zh tmx parallel
* BICLEANER AI WITH ES, REVERSED (zh-es)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-tico-19/v2020-10-28/tmx/es-zh.tmx.gz && zcat es-zh.tmx.gz > tico19.zh-es && rm es-zh.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/tico19.zh-es ./yaml_dir/tico19.zh-es.yaml zh es tmx parallel
* BICLEANER AI MULTILINGUAL EN-XX (en-eo)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-Books/v1/tmx/en-eo.tmx.gz && zcat en-eo.tmx.gz > books.en-eo && rm en-eo.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/books.en-eo ./yaml_dir/books.en-eo.yaml en eo tmx parallel
* BICLEANER AI MULTILINGUAL XX-EN, REVERSED (eo-en)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-Books/v1/tmx/en-eo.tmx.gz && zcat en-eo.tmx.gz > books.eo-en && rm en-eo.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/books.eo-en ./yaml_dir/books.eo-en.yaml eo en tmx parallel
* NO BICLEANER (de-ru)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-KDEdoc/v1/tmx/de-ru.tmx.gz && zcat de-ru.tmx.gz > kde.de-ru && rm de-ru.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/kde.de-ru ./yaml_dir/kde.de-ru.yaml de ru tmx parallel
* NO BICLEANER (es-ru)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-TildeMODEL/v2018/tmx/es-ru.tmx.gz && zcat es-ru.tmx.gz > tilde.es-ru && rm es-ru.tmx.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/tilde.es-ru ./yaml_dir/tilde.es-ru.yaml es ru tmx parallel
* MONOCLEANER (en)
cd /work/uploaded_corpora/ && wget https://object.pouta.csc.fi/OPUS-ELRC-3077-wikipedia_health/v1/mono/en.txt.gz && zcat en.txt.gz > ELRC.health.en && rm en.txt.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ELRC.health.en ./yaml_dir/ELRC.health.en.yaml en - tsv mono
* MONOCLEANER (es)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-Ubuntu/v14.10/mono/es.txt.gz && zcat es.txt.gz > ubuntu.es && rm es.txt.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/ubuntu.es ./yaml_dir/ubuntu.es.yaml es - tsv mono
* NO MONOCLEANER (eo)
cd /work/uploaded_corpora && wget https://object.pouta.csc.fi/OPUS-Books/v1/mono/eo.txt.gz && zcat eo.txt.gz > books.eo && rm eo.txt.gz && cd /work
bash ./scripts/runstats.sh ./uploaded_corpora/books.eo ./yaml_dir/books.eo.yaml eo - tsv mono