Skip to content

Commit

Permalink
Added README
Browse files Browse the repository at this point in the history
  • Loading branch information
mjpost committed May 27, 2023
1 parent e30a130 commit d3d4d55
Showing 1 changed file with 135 additions and 1 deletion.
136 changes: 135 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,135 @@
Coming soon (really).
# DocMT code and data

This repository contains code and data used for the experiments in the following paper:

> Escaping the sentence-level paradigm in machine translation
> Matt Post, Marcin Junczys-Dowmunt
> https://arxiv.org/abs/2304.12959v1
## Setup

Install the submodules, which are edits we made to existing repositories:

git submodule init
git submodule update

This will install our forks of five public repositories under `./ext`.
These are:

* `ext/ContraPro` ([ContraPro](https://github.com/ZurichNLP/ContraPro)
* `ext/ContraPro-EN-FR` ([Large-contrastive-pronoun-testset-EN-FR](https://github.com/rbawden/Large-contrastive-pronoun-testset-EN-FR))
* `ext/ContraWSD` ([ContraWSD](https://github.com/ZurichNLP/ContraWSD)
* `ext/GTWiC` ([GTWiC](https://github.com/lena-voita/good-translation-wrong-in-context))
* `ext/discourse-mt-test-sets` ([discourse-mt-test-sets](https://github.com/rbawden/discourse-mt-test-sets))

## Scripts

Under `bin/` are two scripts:

* `pack.py`: Takes a TSV stream of (sentence,docid) and assembles single-line documents with sentences joined by a delimiter (` <eos>` by default)
* `extract_sent.py`: Takes a single-line assembled document and extracts the requested sentence.

## ContraPro (English--German)

**Generating the files**

Using our modified ContraPro dataset, you first need to change to that directory and run the download script to download OpenSubs. Then run:

# maximum 250 tokens, en-de
./ext/ContraPro/bin/json2text.py -m 250 --spm /path/to/spm/model --json-file ContraPro/contrapro.json > contrapro.en-de.tsv

This will print the file to STDOUT: 36,031 lines of correct and contrastive lines. One of the fields has a value of "correct" or "contrastive"; for generative results, use only the former. You can output only these using the `--correct-only` flag.

For French, you need to pass `-0` (since its JSON file uses 0-indexing), along with some other options.

# max 10 sents, en-fr
cd ext/ContraPro-EN-FR/OpenSubs
../../ContraPro/bin/json2text.py --dir ./documents --json-file testset-en-fr.json -s en -t fr -m 250 -ms 10 --spm /path/to/spm/model -0 --correct-only > genpro.max250+10.en-fr.tsv

There are many other options:
-h, --help show this help message
and exit
--source SOURCE, -s SOURCE
--target TARGET, -t TARGET
--dir DIR, -d DIR
--max-sents MAX_SENTS, -ms MAX_SENTS
Maximum number of
context sentences
--max-tokens MAX_TOKENS, -m MAX_TOKENS
Maximum length in
subword tokens
--sents-before SENTS_BEFORE, -sb SENTS_BEFORE
Num sentences previous
context
--tokens-before TOKENS_BEFORE, -tb TOKENS_BEFORE
Num tokens in previous
context
--separator SEPARATOR
--spm SPM
--zero, -0 indices are already
zeroed (French)
--offset OFFSET Add this number to each
segment ID
--correct-only only output correct
lines
--json-file JSON_FILE, -j JSON_FILE

**File format**

Each file is a TSV with the following fields:
- index (0-based) of the "payload" sentence
- distance of anaphora
- whether the sentence is a correct or contrastive variant
- the correct pronoun
- the complete source sentence with context
- the complete reference sentence with context

**Translating**

You just want the fourth field.

cut -f 4 genpro.max250.en-de.tsv \
| your-decoder [args] \
> out-doc.genpro.max250.en-de.tsv

To translate just sentences without context, use `./bin/extract_sent.py`, which extracts a sentence from a line, with `<eos>` tags as the delimiter:

# grab the source field, get the last sentence, translate
export PATH=$PATH:$(pwd)/bin
cut -f 4 genpro.max250.en-de.tsv \
| extract_sent.py -i -1 \
| your-decoder [args] \
> out-sent.genpro.max250.en-de.tsv

**Evaluation**

Now, to evaluate the GenPro accuracy, use `evaluate_tsv.py`, found in our fork of ContraPro.

There are two main arguments to pay attention to: `-p`, which selects which pronouns to report accuracy over (a list, default "all", which reports on all of them), and `-d i j`, which selects the distances to use (inclusive). In the paper, we report accuracy on "all" with distance range 1..10:

# evaluate all pronouns at distances 1..10
paste genpro.max250.en-de.tsv out-doc.genpro.max250.en-de.tsv \
| ./ext/ContraPro/bin/evaluate_tsv.py -d 1 10 -p all

# accuracy of "sie" and "er", intrasentence:
paste genpro.max250.en-de.tsv out-doc.genpro.max250.en-de.tsv \
| ./ext/ContraPro/bin/evaluate_tsv.py -d 0 0 -p sie er

What this script does: it grabs the correct pronoun, and then splits the system output on `<eos>`, looking in the last sentence for the correct pronoun, using whole-token, case-insensitive matching.

## ContraWSD, GTWiC, discourse-mt-test-sets

More details coming soon.

## Citation

If you make use of this code, please cite it as

> @misc{post2023escaping,
> title={Escaping the sentence-level paradigm in machine translation},
> author={Matt Post and Marcin Junczys-Dowmunt},
> year={2023},
> eprint={2304.12959},
> archivePrefix={arXiv},
> primaryClass={cs.CL}
>}

0 comments on commit d3d4d55

Please sign in to comment.