Unbabel · kepler · Jan 29, 2019 · Jan 29, 2019 · Jan 29, 2019 · Feb 13, 2019
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-Word Quality Estimation for NMT
+Quality Estimation for NMT
 ======
 
 This is an updated version of the WMT word-level quality estimation task (Bojar
@@ -13,40 +13,83 @@ words that can be related to target side errors and one or more consecutive
 insertions after tercom alignment are indicated as a single gap (insertion)
 error.
 
-# Following tools are needed
+# Preprocessing tools
 
-## Install Fast Align
+## Tokenization
+Before generating alignment tags, it is necessary to tokenize and truecase source, mt and post-edited files. The __moses__ tokenizer is the default choice for most languages, but different tokenisers might be optimal for some languages.
 
-Download zip and uncompress it into the `./external_tools/` folder. In Unix
-systems this can be done with
+ We provide below a breakdown of the proposed tokeniser per language (the indicated tokenisers are used by default for the generation of the MLQE-PE 2021 data).  
 
-    mkdir ./external_tools/
-    cd ./external_tools/
-    wget https://github.com/clab/fast_align/archive/master.zip
-    unzip master.zip
-    rm master.zip
-
-Then 
+| Language  | Code | Tokenizer  |
+|-----------|------|------------|
+| Chinese   |  zh  | jieba      |
+| Czech     |  cs  | moses      |
+| English   |  en  | moses      |
+| Esthonian |  et  | moses      |
+| German    |  de  | moses      |
+| Japanese  |  ja  | fugashi    |
+| Khmer     |  km  | nltk-khmer |
+| Marathi   |  mr  | indic_nlp  |
+| Nepalese  |  ne  | indic_nlp  |
+| Pashto    |  ps  | moses      |
+| Romanian  |  ro  | moses      |
+| Russian   |  ru  | moses      |
+| Sinhala   |  si  | indic_nlp  |
 
-    cd fast_align-master/
+We provide installation and versioning information below for each of the proposed tokenizers:
 
-check the `README.md` in that folder as there may be extra libraries needed.
-Ubuntu friendly commands are provided to instal these. With the needed
-libraries just do
+### Moses Installation
+There are various wrappers for moses tokenizer with small output discrepancies among them. For the WMT2021 QE shared task we use the perl script mosestokenizer, made avalable in the scripts of the mosesdecoder [github repo](https://github.com/moses-smt/mosesdecoder). 
 
-    mkdir build
-    build
-    cmake ..
-    make
+__Usage:__ Apart from specifying the specific language extention (en|de|cs etc), we also use the `--no-escape` option. The `--no-escape` option prevents automatic conversion of HTML entities such as `'` to `&apos;`.
 
-as indicated in the `fast_align-master/README.md`. If everything goes right,
-this should create
+### Jieba Installation
+The jieba tokeniser can be installed easily with:
 
-    ./external_tools/fast_align-master/build/fast_align  
+    pip install jieba
 
-## Install Tercom
+For the  WMT2021 QE shared task the jieba version used is the jieba 0.42.1 
 
-Just go to
+fugashi-1.1.1
+
+###
+
+### Indic-NLP
+Requires the installation of the indic-nlp-library (for WMT21 the version used was indic-nlp-library-0.81) with: 
+
+    pip install indic-nlp-library
+
+After installation it is necessary to create a directory for  Indic NLP Resources and then export the path. The default setup is to have the directory in the external_tools:
+
+    export INDIC_RESOURCES_PATH='qe-corpus-builder/external_tools/indic_nlp_resources'
+
+
+## True-casing
+
+True-casing needs to preceed the MT-PE alignments and HTER calculation. Moses was used to train and apply true-casing for all language pairs. New models can be trained with the perl script made available in the `modes-decoder` repo.
+
+    perl /path/to/moses/scripts/recaser/truecase.perl --model truecaser.model < text.tok.source > text.tok.tc.src
+
+
+## Alignment
+To obtain HTER scores and word tags, we need to align source-MT, source-PE and MT-PE.
+
+To extract the source-MT/PE alignments we use __Simalign__.
+
+### Simalign Installation
+Install Simalign from the source ([github repo](https://github.com/cisnlp/simalign)) or install by pip:
+
+    pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign
+
+
+### Usage
+ We use the multilingual XLM-Roberta (base) model as encoder, and follow the SimAlign paper to decide the matching mode based on the language pairs [[1]](#1).
+
+__Notes__:
+Previous versions of the corpus_builder used fast-align to get the alignments. See [the previous github version](https://github.com/deep-spin/qe-corpus-builder) for more details.
+
+### TerCOM Installation
+Tercom can be downloaded from:
 
     http://www.cs.umd.edu/~snover/tercom/
 
@@ -61,38 +104,18 @@ If you are sucesful the following file should be available
 
     ./external_tools/tercom-0.7.25/tercom.7.25.jar
 
-## Generating the first version of the tags 
-
-This is a simple example using WMT2017. In reality you will need to train fast
-align from a sufficiently big corpus. 
-
-Uncompress the WMT2017 data on a `DATA` folder. This should look like
-
-    mkdir DATA
-    DATA/WMT2017/task2_de-en_training
-    DATA/WMT2017/task2_de-en_training-dev
-    DATA/WMT2017/task2_de-en_dev         
-    DATA/WMT2017/task2_en-de_dev  
-    DATA/WMT2017/task2_en-de_training
-    DATA/WMT2017/task2_de-en_test
-    DATA/WMT2017/task2_en-de_test  
-
-Then train `fast_align` with
+# Steps
 
-    cd corpus_generation/
-    bash train_fast_align_wmt2017.sh
+The corpus can be generated by calling `generate_tags_hter.py` from the `corpus_generation` folder. An example of running for an en-de language pair follows:  
 
-Once fast align is trained, call the following to generate the tags
+    python3 generate_tags_hter.py --src /path/to/data/folder/and/src/file/dev.src --mt /path/to/data/folder/and/src/file/dev.mt --pe /path/to/data/folder/and/src/file/dev.pe --src_lang en --tgt_lang de --src_tc /path/to/trained/truecase/src/model/truecase.all.en.ms.model --tgt_tc /path/to/trained/truecase/tgt/model/truecase.all.de.ms.model --token --truecase --align 
 
-    bash get_tags_wmt2017.sh 
 
-Tags are currently stored under e.g.
+To convert the MLQE-PE data to the [WMT QE 2022](https://wmt-qe-task.github.io/subtasks/task1/) word-level tags (without gap tags and annotating deletions to the right) you can use:
 
-    DATA/WMT2017/temporal_files/task2_en-de_training/
+    python3 generate_tags_hter.py --src /path/to/data/folder/and/src/file/dev.src --mt /path/to/data/folder/and/src/file/dev.mt --pe /path/to/data/folder/and/src/file/dev.pe --src_lang en --tgt_lang de --src_tc /path/to/trained/truecase/src/model/truecase.all.en.ms.model --tgt_tc /path/to/trained/truecase/tgt/model/truecase.all.de.ms.model --token --truecase --align --delete right 
 
-## Exploring the tags
 
-You can explore the created tags using the notebook in `notebooks`. For this 
-you will have to install the `jupyter` Python module
 
-    jupyter-notebook notebooks/Investigate-BAD-tag-approaches.ipynb
+# References
+<a id="1">[1]</a>  Sabet, Masoud Jalili, et al. "SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020.
diff --git a/corpus_generation/README.md b/corpus_generation/README.md
@@ -25,7 +25,7 @@ post edited text. From the alignments, insertions and substitutions on MT are
 marked as BAD tags over the words. Deletions are marked as BAD tags over the
 gaps. 
 
-Using alignments between MT and PE (tercom) and source PE (fast_align), target
+Using alignments between MT and PE (tercom) and source PE (SimAlign), target
 side BAD tags are propagated back to the source to signal words that are
 related to the error. Three version were considered in this freest version of
 the corpus. These can be selected inside of the `get_tags*` scripts using the
@@ -35,10 +35,6 @@ variable `$fluency_rule`.
 - `ignore-shift-set` if a BAD token appears also in PE do not propagate to source
 - `missing-only` only propagate for missing words (deletions)
 
-Default used was `normal`.
+Default setting used was `normal`.
 
-## Generating Corpora
 
-The new tags can be generated from any previous WMT word-level QE corpus. As
-examples we provide scripts for WMT2015 and WMT2017. See 
-`word-level-qe-corpus-builder/README.md`.
diff --git a/corpus_generation/generate_tags_hter.py b/corpus_generation/generate_tags_hter.py
@@ -0,0 +1,97 @@
+import os
+from sqlite3 import TimestampFromTicks
+import subprocess
+import sys
+import argparse
+
+sys.path.insert(0, 'tools')
+
+from tools.preprocess import truecase
+from tools.aligners import align_simaligner as align
+from tools.generate_BAD_tags import generate_bad_ok_tags
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--src', help='Source', type=str)
+    parser.add_argument('--mt', help='MT hypothesis (target)', type=str)
+    parser.add_argument('--pe', help='Post edited translation', type=str)
+    parser.add_argument('--src_lang', help='Source language abbreviation', type=str)
+    parser.add_argument('--tgt_lang', help='Target language abbreviation', type=str)
+    parser.add_argument('--src_tc', type=str, default='')
+    parser.add_argument('--tgt_tc', type=str, default='')
+    parser.add_argument('--token', dest='token', action='store_true')
+    parser.add_argument('--truecase', dest='truecase', action='store_true')
+    parser.add_argument('--gaps', dest='gaps', action='store_true')
+    parser.add_argument('--delete', type=str, default = 'none')
+    parser.add_argument('--align',  dest='align', action='store_true')
+
+    parser.set_defaults(token=False)
+    parser.set_defaults(truecase=False)
+    parser.set_defaults(gaps=False)
+    parser.set_defaults(align=False)
+
+    return parser.parse_args()
+
+def main(args):
+    print(args)
+    src = args.src 
+    mt = args.mt   
+    pe = args.pe  
+    path = os.path.dirname(os.path.abspath(src))
+
+    if args.token:
+        from tools.alltokenisers import tokenize
+        tok_src = tokenize(args.src_lang, src)
+        tok_mt = tokenize(args.tgt_lang, mt)
+        tok_pe = tokenize(args.tgt_lang, pe)
+    else:
+        tok_src = src
+        tok_mt = mt 
+        tok_pe = pe 
+
+    if args.truecase:
+        truecase_script = "../external_tools/mosesdecoder/scripts/recaser/truecase.perl"
+        tok_tc_src = truecase(truecase_script, args.src_tc, tok_src)
+        tok_tc_mt = truecase(truecase_script, args.tgt_tc, tok_mt)
+        tok_tc_pe = truecase(truecase_script, args.tgt_tc, tok_pe)
+    else:
+        tok_tc_src = tok_src
+        tok_tc_mt = tok_mt
+        tok_tc_pe = tok_pe
+
+
+    src_mt_align = src.rsplit('.')[0]+'.src-mt.alignments'
+    src_pe_align = src.rsplit('.')[0]+'.src-pe.alignments'
+    mt_pe_align = src.rsplit('.')[0]+'.mt-pe.alignments'
+
+    if args.align:
+        align(tok_src, tok_mt, src_mt_align, [args.src_lang, args.tgt_lang])
+        align(tok_src, tok_pe, src_pe_align, [args.src_lang, args.tgt_lang])
+
+
+    tercom = "temp/tercom/"
+
+    params = ["bash ./tools/tercom.sh  " + tok_tc_mt + " " + tok_tc_pe +"  "+ tercom + "  " + mt_pe_align + "  false"]
+    print(" ".join(params))
+    p = subprocess.Popen(params, shell=True)
+    p.wait() 
+
+    src_tags = src.rsplit('.')[0]+'.source_tags'
+    tgt_tags = src.rsplit('.')[0]+'.target_tags'
+    hter = src.rsplit('.')[0]+'.hter'
+    print("ALIGNED MT PE")
+
+    generate_bad_ok_tags(tok_tc_src, tok_tc_mt, tok_tc_pe, mt_pe_align, src_pe_align, 'normal', src_tags, tgt_tags, args.gaps, args.delete)
+    params = ["bash ./tools/tercom.sh " +  tok_tc_mt+ " " + tok_tc_pe + " " + tercom + " "+ mt_pe_align + " true"]
+    p2 = subprocess.Popen(params, shell=True)
+    p2.wait() 
+
+    params = ["tail -n +3 temp/tercom/out_tercom_file.ter | awk '{if ($4 > 1) hter=1; else hter=$4; printf \"%.6f\\n\",hter}' > "+hter]
+    p3 = subprocess.Popen(params, shell=True)
+    p3.wait() 
+    print('processed')
+
+if __name__ == '__main__':
+    args = parse_args()
+    main(args)
diff --git a/corpus_generation/get_tags_wmt2015.sh b/corpus_generation/get_tags_wmt2015.sh