Skip to content

Commit

Permalink
Add instructions to reproduce Understanding Back-translation at Scale (
Browse files Browse the repository at this point in the history
…facebookresearch#1021)

Summary: Pull Request resolved: fairinternal/fairseq-py#1021

Differential Revision: D20077161

Pulled By: myleott

fbshipit-source-id: da7f38dbac9551f29a88be3f421f8e38d9a81133
  • Loading branch information
myleott authored and facebook-github-bot committed Feb 25, 2020
1 parent 2728f9b commit b152183
Show file tree
Hide file tree
Showing 8 changed files with 680 additions and 8 deletions.
249 changes: 249 additions & 0 deletions examples/backtranslation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,255 @@ en2de_ensemble.translate('Hello world!')
# 'Hallo Welt!'
```

## Training your own model (WMT'18 English-German)

The following instructions can be adapted to reproduce the models from the paper.


#### Step 1. Prepare parallel data and optionally train a baseline (English-German) model

First download and preprocess the data:
```bash
# Download and prepare the data
cd examples/backtranslation/
bash prepare-wmt18en2de.sh
cd ../..

# Binarize the data
TEXT=examples/backtranslation/wmt18_en_de
fairseq-preprocess \
--joined-dictionary \
--source-lang en --target-lang de \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt18_en_de --thresholdtgt 0 --thresholdsrc 0 \
--workers 20

# Copy the BPE code into the data-bin directory for future use
cp examples/backtranslation/wmt18_en_de/code data-bin/wmt18_en_de/code
```

(Optionally) Train a baseline model (English-German) using just the parallel data:
```bash
CHECKPOINT_DIR=checkpoints_en_de_parallel
fairseq-train --fp16 \
data-bin/wmt18_en_de \
--source-lang en --target-lang de \
--arch transformer_wmt_en_de_big --share-all-embeddings \
--dropout 0.3 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--max-tokens 3584 --update-freq 16 \
--max-update 30000 \
--save-dir $CHECKPOINT_DIR
# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
# different number of GPUs.
```

Average the last 10 checkpoints:
```bash
python scripts/average_checkpoints.py \
--inputs $CHECKPOINT_DIR \
--num-epoch-checkpoints 10 \
--output $CHECKPOINT_DIR/checkpoint.avg10.pt
```

Evaluate BLEU:
```bash
# tokenized BLEU on newstest2017:
bash examples/backtranslation/tokenized_bleu.sh \
wmt17 \
en-de \
data-bin/wmt18_en_de \
data-bin/wmt18_en_de/code \
$CHECKPOINT_DIR/checkpoint.avg10.pt
# BLEU4 = 29.57, 60.9/35.4/22.9/15.5 (BP=1.000, ratio=1.014, syslen=63049, reflen=62152)
# compare to 29.46 in Table 1, which is also for tokenized BLEU

# generally it's better to report (detokenized) sacrebleu though:
bash examples/backtranslation/sacrebleu.sh \
wmt17 \
en-de \
data-bin/wmt18_en_de \
data-bin/wmt18_en_de/code \
$CHECKPOINT_DIR/checkpoint.avg10.pt
# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 29.0 60.6/34.7/22.4/14.9 (BP = 1.000 ratio = 1.013 hyp_len = 62099 ref_len = 61287)
```


#### Step 2. Back-translate monolingual German data

Train a reverse model (German-English) to do the back-translation:
```bash
CHECKPOINT_DIR=checkpoints_de_en_parallel
fairseq-train --fp16 \
data-bin/wmt18_en_de \
--source-lang de --target-lang en \
--arch transformer_wmt_en_de_big --share-all-embeddings \
--dropout 0.3 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--max-tokens 3584 --update-freq 16 \
--max-update 30000 \
--save-dir $CHECKPOINT_DIR
# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
# different number of GPUs.
```

Let's evaluate the back-translation (BT) model to make sure it is well trained:
```bash
bash examples/backtranslation/sacrebleu.sh \
wmt17 \
de-en \
data-bin/wmt18_en_de \
data-bin/wmt18_en_de/code \
$CHECKPOINT_DIR/checkpoint_best.py
# BLEU+case.mixed+lang.de-en+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 34.9 66.9/41.8/28.5/19.9 (BP = 0.983 ratio = 0.984 hyp_len = 63342 ref_len = 64399)
# compare to the best system from WMT'17 which scored 35.1: http://matrix.statmt.org/matrix/systems_list/1868
```

Next prepare the monolingual data:
```bash
# Download and prepare the monolingual data
# By default the script samples 25M monolingual sentences, which after
# deduplication should be just over 24M sentences. These are split into 25
# shards, each with 1M sentences (except for the last shard).
cd examples/backtranslation/
bash prepare-de-monolingual.sh
cd ../..

# Binarize each shard of the monolingual data
TEXT=examples/backtranslation/wmt18_de_mono
for SHARD in $(seq -f "%02g" 0 24); do \
fairseq-preprocess \
--only-source \
--source-lang de --target-lang en \
--joined-dictionary \
--srcdict data-bin/wmt18_en_de/dict.de.txt \
--testpref $TEXT/bpe.monolingual.dedup.${SHARD} \
--destdir data-bin/wmt18_de_mono/shard${SHARD} \
--workers 20; \
cp data-bin/wmt18_en_de/dict.en.txt data-bin/wmt18_de_mono/shard${SHARD}/; \
done
```

Now we're ready to perform back-translation over the monolingual data. The
following command generates via sampling, but it's possible to use greedy
decoding (`--beam 1`), beam search (`--beam 5`),
top-k sampling (`--sampling --beam 1 --sampling-topk 10`), etc.:
```bash
mkdir backtranslation_output
for SHARD in $(seq -f "%02g" 0 24); do \
fairseq-generate --fp16 \
data-bin/wmt18_de_mono/shard${SHARD} \
--path $CHECKPOINT_DIR/checkpoint_best.pt \
--skip-invalid-size-inputs-valid-test \
--max-tokens 4096 \
--sampling --beam 1 \
> backtranslation_output/sampling.shard${SHARD}.out; \
done
```

After BT, use the `extract_bt_data.py` script to re-combine the shards, extract
the back-translations and apply length ratio filters:
```bash
python examples/backtranslation/extract_bt_data.py \
--minlen 1 --maxlen 250 --ratio 1.5 \
--output backtranslation_output/bt_data --srclang en --tgtlang de \
backtranslation_output/sampling.shard*.out

# Ensure lengths are the same:
# wc -l backtranslation_output/bt_data.{en,de}
# 21795614 backtranslation_output/bt_data.en
# 21795614 backtranslation_output/bt_data.de
# 43591228 total
```

Binarize the filtered BT data and combine it with the parallel data:
```bash
TEXT=backtranslation_output
fairseq-preprocess \
--source-lang en --target-lang de \
--joined-dictionary \
--srcdict data-bin/wmt18_en_de/dict.en.txt \
--trainpref $TEXT/bt_data \
--destdir data-bin/wmt18_en_de_bt \
--workers 20

# We want to train on the combined data, so we'll symlink the parallel + BT data
# in the wmt18_en_de_para_plus_bt directory. We link the parallel data as "train"
# and the BT data as "train1", so that fairseq will combine them automatically
# and so that we can use the `--upsample-primary` option to upsample the
# parallel data (if desired).
PARA_DATA=$(readlink -f data-bin/wmt18_en_de)
BT_DATA=$(readlink -f data-bin/wmt18_en_de_bt)
COMB_DATA=data-bin/wmt18_en_de_para_plus_bt
mkdir -p $COMB_DATA
for LANG in en de; do \
ln -s ${PARA_DATA}/dict.$LANG.txt ${COMB_DATA}/dict.$LANG.txt; \
for EXT in bin idx; do \
ln -s ${PARA_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train.en-de.$LANG.$EXT; \
ln -s ${BT_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train1.en-de.$LANG.$EXT; \
ln -s ${PARA_DATA}/valid.en-de.$LANG.$EXT ${COMB_DATA}/valid.en-de.$LANG.$EXT; \
ln -s ${PARA_DATA}/test.en-de.$LANG.$EXT ${COMB_DATA}/test.en-de.$LANG.$EXT; \
done; \
done
```


#### 3. Train an English-German model over the combined parallel + BT data

Finally we can train a model over the parallel + BT data:
```bash
CHECKPOINT_DIR=checkpoints_en_de_parallel_plus_bt
fairseq-train --fp16 \
data-bin/wmt18_en_de_para_plus_bt \
--upsample-primary 16 \
--source-lang en --target-lang de \
--arch transformer_wmt_en_de_big --share-all-embeddings \
--dropout 0.3 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 0.0007 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--max-tokens 3584 --update-freq 16 \
--max-update 100000 \
--save-dir $CHECKPOINT_DIR
# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a
# different number of GPUs.
```

Average the last 10 checkpoints:
```bash
python scripts/average_checkpoints.py \
--inputs $CHECKPOINT_DIR \
--num-epoch-checkpoints 10 \
--output $CHECKPOINT_DIR/checkpoint.avg10.pt
```

Evaluate BLEU:
```bash
# tokenized BLEU on newstest2017:
bash examples/backtranslation/tokenized_bleu.sh \
wmt17 \
en-de \
data-bin/wmt18_en_de \
data-bin/wmt18_en_de/code \
$CHECKPOINT_DIR/checkpoint.avg10.pt
# BLEU4 = 32.35, 64.4/38.9/26.2/18.3 (BP=0.977, ratio=0.977, syslen=60729, reflen=62152)
# compare to 32.35 in Table 1, which is also for tokenized BLEU

# generally it's better to report (detokenized) sacrebleu:
bash examples/backtranslation/sacrebleu.sh \
wmt17 \
en-de \
data-bin/wmt18_en_de \
data-bin/wmt18_en_de/code \
$CHECKPOINT_DIR/checkpoint.avg10.pt
# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 31.5 64.3/38.2/25.6/17.6 (BP = 0.971 ratio = 0.971 hyp_len = 59515 ref_len = 61287)
```


## Citation
```bibtex
@inproceedings{edunov2018backtranslation,
Expand Down
41 changes: 41 additions & 0 deletions examples/backtranslation/deduplicate_lines.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import argparse
import fileinput
import hashlib
from multiprocessing import Pool
import sys


def get_hashes_and_lines(raw_line):
hash = hashlib.md5(raw_line).hexdigest()
return hash, raw_line


def main():
parser = argparse.ArgumentParser()
parser.add_argument('--workers', type=int, default=10)
parser.add_argument('files', nargs='*', help='input files')
args = parser.parse_args()

seen = set()
with fileinput.input(args.files, mode='rb') as h:
pool = Pool(args.workers)
results = pool.imap_unordered(get_hashes_and_lines, h, 1000)
for i, (hash, raw_line) in enumerate(results):
if hash not in seen:
seen.add(hash)
sys.stdout.buffer.write(raw_line)
if i % 1000000 == 0:
print(i, file=sys.stderr, end="", flush=True)
elif i % 100000 == 0:
print(".", file=sys.stderr, end="", flush=True)
print(file=sys.stderr, flush=True)


if __name__ == '__main__':
main()
59 changes: 59 additions & 0 deletions examples/backtranslation/extract_bt_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import argparse
import fileinput

from tqdm import tqdm


def main():
parser = argparse.ArgumentParser(description=(
'Extract back-translations from the stdout of fairseq-generate. '
'If there are multiply hypotheses for a source, we only keep the first one. '
))
parser.add_argument('--output', required=True, help='output prefix')
parser.add_argument('--srclang', required=True, help='source language (extracted from H-* lines)')
parser.add_argument('--tgtlang', required=True, help='target language (extracted from S-* lines)')
parser.add_argument('--minlen', type=int, help='min length filter')
parser.add_argument('--maxlen', type=int, help='max length filter')
parser.add_argument('--ratio', type=float, help='ratio filter')
parser.add_argument('files', nargs='*', help='input files')
args = parser.parse_args()

def validate(src, tgt):
srclen = len(src.split(' ')) if src != '' else 0
tgtlen = len(tgt.split(' ')) if tgt != '' else 0
if (
(args.minlen is not None and (srclen < args.minlen or tgtlen < args.minlen))
or (args.maxlen is not None and (srclen > args.maxlen or tgtlen > args.maxlen))
or (args.ratio is not None and (max(srclen, tgtlen) / float(min(srclen, tgtlen)) > args.ratio))
):
return False
return True

def safe_index(toks, index, default):
try:
return toks[index]
except IndexError:
return default

with open(args.output + '.' + args.srclang, 'w') as src_h, \
open(args.output + '.' + args.tgtlang, 'w') as tgt_h:
for line in tqdm(fileinput.input(args.files)):
if line.startswith('S-'):
tgt = safe_index(line.rstrip().split('\t'), 1, '')
elif line.startswith('H-'):
if tgt is not None:
src = safe_index(line.rstrip().split('\t'), 2, '')
if validate(src, tgt):
print(src, file=src_h)
print(tgt, file=tgt_h)
tgt = None


if __name__ == '__main__':
main()
Loading

0 comments on commit b152183

Please sign in to comment.