Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update from Erick's changes for WMT19 #2

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 75 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Word Quality Estimation for NMT
Quality Estimation for NMT
======

This is an updated version of the WMT word-level quality estimation task (Bojar
Expand All @@ -13,40 +13,83 @@ words that can be related to target side errors and one or more consecutive
insertions after tercom alignment are indicated as a single gap (insertion)
error.

# Following tools are needed
# Preprocessing tools

## Install Fast Align
## Tokenization
Before generating alignment tags, it is necessary to tokenize and truecase source, mt and post-edited files. The __moses__ tokenizer is the default choice for most languages, but different tokenisers might be optimal for some languages.

Download zip and uncompress it into the `./external_tools/` folder. In Unix
systems this can be done with
We provide below a breakdown of the proposed tokeniser per language (the indicated tokenisers are used by default for the generation of the MLQE-PE 2021 data).

mkdir ./external_tools/
cd ./external_tools/
wget https://github.com/clab/fast_align/archive/master.zip
unzip master.zip
rm master.zip

Then
| Language | Code | Tokenizer |
|-----------|------|------------|
| Chinese | zh | jieba |
| Czech | cs | moses |
| English | en | moses |
| Esthonian | et | moses |
| German | de | moses |
| Japanese | ja | fugashi |
| Khmer | km | nltk-khmer |
| Marathi | mr | indic_nlp |
| Nepalese | ne | indic_nlp |
| Pashto | ps | moses |
| Romanian | ro | moses |
| Russian | ru | moses |
| Sinhala | si | indic_nlp |

cd fast_align-master/
We provide installation and versioning information below for each of the proposed tokenizers:

check the `README.md` in that folder as there may be extra libraries needed.
Ubuntu friendly commands are provided to instal these. With the needed
libraries just do
### Moses Installation
There are various wrappers for moses tokenizer with small output discrepancies among them. For the WMT2021 QE shared task we use the perl script mosestokenizer, made avalable in the scripts of the mosesdecoder [github repo](https://github.com/moses-smt/mosesdecoder).

mkdir build
build
cmake ..
make
__Usage:__ Apart from specifying the specific language extention (en|de|cs etc), we also use the `--no-escape` option. The `--no-escape` option prevents automatic conversion of HTML entities such as `'` to `'`.

as indicated in the `fast_align-master/README.md`. If everything goes right,
this should create
### Jieba Installation
The jieba tokeniser can be installed easily with:

./external_tools/fast_align-master/build/fast_align
pip install jieba

## Install Tercom
For the WMT2021 QE shared task the jieba version used is the jieba 0.42.1

Just go to
fugashi-1.1.1

###

### Indic-NLP
Requires the installation of the indic-nlp-library (for WMT21 the version used was indic-nlp-library-0.81) with:

pip install indic-nlp-library

After installation it is necessary to create a directory for Indic NLP Resources and then export the path. The default setup is to have the directory in the external_tools:

export INDIC_RESOURCES_PATH='qe-corpus-builder/external_tools/indic_nlp_resources'


## True-casing

True-casing needs to preceed the MT-PE alignments and HTER calculation. Moses was used to train and apply true-casing for all language pairs. New models can be trained with the perl script made available in the `modes-decoder` repo.

perl /path/to/moses/scripts/recaser/truecase.perl --model truecaser.model < text.tok.source > text.tok.tc.src


## Alignment
To obtain HTER scores and word tags, we need to align source-MT, source-PE and MT-PE.

To extract the source-MT/PE alignments we use __Simalign__.

### Simalign Installation
Install Simalign from the source ([github repo](https://github.com/cisnlp/simalign)) or install by pip:

pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign


### Usage
We use the multilingual XLM-Roberta (base) model as encoder, and follow the SimAlign paper to decide the matching mode based on the language pairs [[1]](#1).

__Notes__:
Previous versions of the corpus_builder used fast-align to get the alignments. See [the previous github version](https://github.com/deep-spin/qe-corpus-builder) for more details.

### TerCOM Installation
Tercom can be downloaded from:

http://www.cs.umd.edu/~snover/tercom/

Expand All @@ -61,38 +104,18 @@ If you are sucesful the following file should be available

./external_tools/tercom-0.7.25/tercom.7.25.jar

## Generating the first version of the tags

This is a simple example using WMT2017. In reality you will need to train fast
align from a sufficiently big corpus.

Uncompress the WMT2017 data on a `DATA` folder. This should look like

mkdir DATA
DATA/WMT2017/task2_de-en_training
DATA/WMT2017/task2_de-en_training-dev
DATA/WMT2017/task2_de-en_dev
DATA/WMT2017/task2_en-de_dev
DATA/WMT2017/task2_en-de_training
DATA/WMT2017/task2_de-en_test
DATA/WMT2017/task2_en-de_test

Then train `fast_align` with
# Steps

cd corpus_generation/
bash train_fast_align_wmt2017.sh
The corpus can be generated by calling `generate_tags_hter.py` from the `corpus_generation` folder. An example of running for an en-de language pair follows:

Once fast align is trained, call the following to generate the tags
python3 generate_tags_hter.py --src /path/to/data/folder/and/src/file/dev.src --mt /path/to/data/folder/and/src/file/dev.mt --pe /path/to/data/folder/and/src/file/dev.pe --src_lang en --tgt_lang de --src_tc /path/to/trained/truecase/src/model/truecase.all.en.ms.model --tgt_tc /path/to/trained/truecase/tgt/model/truecase.all.de.ms.model --token --truecase --align

bash get_tags_wmt2017.sh

Tags are currently stored under e.g.
To convert the MLQE-PE data to the [WMT QE 2022](https://wmt-qe-task.github.io/subtasks/task1/) word-level tags (without gap tags and annotating deletions to the right) you can use:

DATA/WMT2017/temporal_files/task2_en-de_training/
python3 generate_tags_hter.py --src /path/to/data/folder/and/src/file/dev.src --mt /path/to/data/folder/and/src/file/dev.mt --pe /path/to/data/folder/and/src/file/dev.pe --src_lang en --tgt_lang de --src_tc /path/to/trained/truecase/src/model/truecase.all.en.ms.model --tgt_tc /path/to/trained/truecase/tgt/model/truecase.all.de.ms.model --token --truecase --align --delete right

## Exploring the tags

You can explore the created tags using the notebook in `notebooks`. For this
you will have to install the `jupyter` Python module

jupyter-notebook notebooks/Investigate-BAD-tag-approaches.ipynb
# References
<a id="1">[1]</a> Sabet, Masoud Jalili, et al. "SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020.
8 changes: 2 additions & 6 deletions corpus_generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ post edited text. From the alignments, insertions and substitutions on MT are
marked as BAD tags over the words. Deletions are marked as BAD tags over the
gaps.

Using alignments between MT and PE (tercom) and source PE (fast_align), target
Using alignments between MT and PE (tercom) and source PE (SimAlign), target
side BAD tags are propagated back to the source to signal words that are
related to the error. Three version were considered in this freest version of
the corpus. These can be selected inside of the `get_tags*` scripts using the
Expand All @@ -35,10 +35,6 @@ variable `$fluency_rule`.
- `ignore-shift-set` if a BAD token appears also in PE do not propagate to source
- `missing-only` only propagate for missing words (deletions)

Default used was `normal`.
Default setting used was `normal`.

## Generating Corpora

The new tags can be generated from any previous WMT word-level QE corpus. As
examples we provide scripts for WMT2015 and WMT2017. See
`word-level-qe-corpus-builder/README.md`.
97 changes: 97 additions & 0 deletions corpus_generation/generate_tags_hter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import os
from sqlite3 import TimestampFromTicks
import subprocess
import sys
import argparse

sys.path.insert(0, 'tools')

from tools.preprocess import truecase
from tools.aligners import align_simaligner as align
from tools.generate_BAD_tags import generate_bad_ok_tags


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--src', help='Source', type=str)
parser.add_argument('--mt', help='MT hypothesis (target)', type=str)
parser.add_argument('--pe', help='Post edited translation', type=str)
parser.add_argument('--src_lang', help='Source language abbreviation', type=str)
parser.add_argument('--tgt_lang', help='Target language abbreviation', type=str)
parser.add_argument('--src_tc', type=str, default='')
parser.add_argument('--tgt_tc', type=str, default='')
parser.add_argument('--token', dest='token', action='store_true')
parser.add_argument('--truecase', dest='truecase', action='store_true')
parser.add_argument('--gaps', dest='gaps', action='store_true')
parser.add_argument('--delete', type=str, default = 'none')
parser.add_argument('--align', dest='align', action='store_true')

parser.set_defaults(token=False)
parser.set_defaults(truecase=False)
parser.set_defaults(gaps=False)
parser.set_defaults(align=False)

return parser.parse_args()

def main(args):
print(args)
src = args.src
mt = args.mt
pe = args.pe
path = os.path.dirname(os.path.abspath(src))

if args.token:
from tools.alltokenisers import tokenize
tok_src = tokenize(args.src_lang, src)
tok_mt = tokenize(args.tgt_lang, mt)
tok_pe = tokenize(args.tgt_lang, pe)
else:
tok_src = src
tok_mt = mt
tok_pe = pe

if args.truecase:
truecase_script = "../external_tools/mosesdecoder/scripts/recaser/truecase.perl"
tok_tc_src = truecase(truecase_script, args.src_tc, tok_src)
tok_tc_mt = truecase(truecase_script, args.tgt_tc, tok_mt)
tok_tc_pe = truecase(truecase_script, args.tgt_tc, tok_pe)
else:
tok_tc_src = tok_src
tok_tc_mt = tok_mt
tok_tc_pe = tok_pe


src_mt_align = src.rsplit('.')[0]+'.src-mt.alignments'
src_pe_align = src.rsplit('.')[0]+'.src-pe.alignments'
mt_pe_align = src.rsplit('.')[0]+'.mt-pe.alignments'

if args.align:
align(tok_src, tok_mt, src_mt_align, [args.src_lang, args.tgt_lang])
align(tok_src, tok_pe, src_pe_align, [args.src_lang, args.tgt_lang])


tercom = "temp/tercom/"

params = ["bash ./tools/tercom.sh " + tok_tc_mt + " " + tok_tc_pe +" "+ tercom + " " + mt_pe_align + " false"]
print(" ".join(params))
p = subprocess.Popen(params, shell=True)
p.wait()

src_tags = src.rsplit('.')[0]+'.source_tags'
tgt_tags = src.rsplit('.')[0]+'.target_tags'
hter = src.rsplit('.')[0]+'.hter'
print("ALIGNED MT PE")

generate_bad_ok_tags(tok_tc_src, tok_tc_mt, tok_tc_pe, mt_pe_align, src_pe_align, 'normal', src_tags, tgt_tags, args.gaps, args.delete)
params = ["bash ./tools/tercom.sh " + tok_tc_mt+ " " + tok_tc_pe + " " + tercom + " "+ mt_pe_align + " true"]
p2 = subprocess.Popen(params, shell=True)
p2.wait()

params = ["tail -n +3 temp/tercom/out_tercom_file.ter | awk '{if ($4 > 1) hter=1; else hter=$4; printf \"%.6f\\n\",hter}' > "+hter]
p3 = subprocess.Popen(params, shell=True)
p3.wait()
print('processed')

if __name__ == '__main__':
args = parse_args()
main(args)
67 changes: 0 additions & 67 deletions corpus_generation/get_tags_wmt2015.sh

This file was deleted.

Loading