Added README

marian-nmt · May 27, 2023 · d3d4d55 · d3d4d55
1 parent e30a130
commit d3d4d55
Showing 1 changed file with 135 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1 +1,135 @@
-Coming soon (really).
+# DocMT code and data
+
+This repository contains code and data used for the experiments in the following paper:
+
+> Escaping the sentence-level paradigm in machine translation
+> Matt Post, Marcin Junczys-Dowmunt
+> https://arxiv.org/abs/2304.12959v1
+
+## Setup
+
+Install the submodules, which are edits we made to existing repositories:
+
+    git submodule init
+    git submodule update
+
+This will install our forks of five public repositories under `./ext`.
+These are:
+
+* `ext/ContraPro` ([ContraPro](https://github.com/ZurichNLP/ContraPro)
+* `ext/ContraPro-EN-FR` ([Large-contrastive-pronoun-testset-EN-FR](https://github.com/rbawden/Large-contrastive-pronoun-testset-EN-FR))
+* `ext/ContraWSD` ([ContraWSD](https://github.com/ZurichNLP/ContraWSD)
+* `ext/GTWiC` ([GTWiC](https://github.com/lena-voita/good-translation-wrong-in-context))
+* `ext/discourse-mt-test-sets` ([discourse-mt-test-sets](https://github.com/rbawden/discourse-mt-test-sets))
+
+## Scripts
+
+Under `bin/` are two scripts:
+
+* `pack.py`: Takes a TSV stream of (sentence,docid) and assembles single-line documents with sentences joined by a delimiter (` <eos>` by default)
+* `extract_sent.py`: Takes a single-line assembled document and extracts the requested sentence.
+
+## ContraPro (English--German)
+
+**Generating the files**
+
+Using our modified ContraPro dataset, you first need to change to that directory and run the download script to download OpenSubs. Then run:
+
+    # maximum 250 tokens, en-de
+    ./ext/ContraPro/bin/json2text.py -m 250 --spm /path/to/spm/model --json-file ContraPro/contrapro.json > contrapro.en-de.tsv
+
+This will print the file to STDOUT: 36,031 lines of correct and contrastive lines. One of the fields has a value of "correct" or "contrastive"; for generative results, use only the former. You can output only these using the `--correct-only` flag.
+
+For French, you need to pass `-0` (since its JSON file uses 0-indexing), along with some other options.
+
+    # max 10 sents, en-fr
+    cd ext/ContraPro-EN-FR/OpenSubs
+    ../../ContraPro/bin/json2text.py --dir ./documents --json-file testset-en-fr.json -s en -t fr -m 250 -ms 10 --spm /path/to/spm/model -0 --correct-only > genpro.max250+10.en-fr.tsv
+
+There are many other options:
+  -h, --help            show this help message
+                        and exit
+  --source SOURCE, -s SOURCE
+  --target TARGET, -t TARGET
+  --dir DIR, -d DIR
+  --max-sents MAX_SENTS, -ms MAX_SENTS
+                        Maximum number of
+                        context sentences
+  --max-tokens MAX_TOKENS, -m MAX_TOKENS
+                        Maximum length in
+                        subword tokens
+  --sents-before SENTS_BEFORE, -sb SENTS_BEFORE
+                        Num sentences previous
+                        context
+  --tokens-before TOKENS_BEFORE, -tb TOKENS_BEFORE
+                        Num tokens in previous
+                        context
+  --separator SEPARATOR
+  --spm SPM
+  --zero, -0            indices are already
+                        zeroed (French)
+  --offset OFFSET       Add this number to each
+                        segment ID
+  --correct-only        only output correct
+                        lines
+  --json-file JSON_FILE, -j JSON_FILE
+
+**File format**
+
+Each file is a TSV with the following fields:
+- index (0-based) of the "payload" sentence
+- distance of anaphora
+- whether the sentence is a correct or contrastive variant
+- the correct pronoun
+- the complete source sentence with context
+- the complete reference sentence with context
+
+**Translating**
+
+You just want the fourth field.
+
+    cut -f 4 genpro.max250.en-de.tsv \
+    | your-decoder [args] \
+    > out-doc.genpro.max250.en-de.tsv
+
+To translate just sentences without context, use `./bin/extract_sent.py`, which extracts a sentence from a line, with `<eos>` tags as the delimiter:
+
+    # grab the source field, get the last sentence, translate
+    export PATH=$PATH:$(pwd)/bin
+    cut -f 4 genpro.max250.en-de.tsv \
+    | extract_sent.py -i -1 \
+    | your-decoder [args] \
+    > out-sent.genpro.max250.en-de.tsv
+
+**Evaluation**
+
+Now, to evaluate the GenPro accuracy, use `evaluate_tsv.py`, found in our fork of ContraPro.
+
+There are two main arguments to pay attention to: `-p`, which selects which pronouns to report accuracy over (a list, default "all", which reports on all of them), and `-d i j`, which selects the distances to use (inclusive). In the paper, we report accuracy on "all" with distance range 1..10:
+
+    # evaluate all pronouns at distances 1..10
+    paste genpro.max250.en-de.tsv out-doc.genpro.max250.en-de.tsv \
+    | ./ext/ContraPro/bin/evaluate_tsv.py -d 1 10 -p all
+
+    # accuracy of "sie" and "er", intrasentence:
+    paste genpro.max250.en-de.tsv out-doc.genpro.max250.en-de.tsv \
+    | ./ext/ContraPro/bin/evaluate_tsv.py -d 0 0 -p sie er
+
+What this script does: it grabs the correct pronoun, and then splits the system output on `<eos>`, looking in the last sentence for the correct pronoun, using whole-token, case-insensitive matching.
+
+## ContraWSD, GTWiC, discourse-mt-test-sets
+
+More details coming soon.
+
+## Citation
+
+If you make use of this code, please cite it as
+
+> @misc{post2023escaping,
+>      title={Escaping the sentence-level paradigm in machine translation},
+>      author={Matt Post and Marcin Junczys-Dowmunt},
+>      year={2023},
+>      eprint={2304.12959},
+>      archivePrefix={arXiv},
+>      primaryClass={cs.CL}
+>}