GitHub - anmoisio/morphogen-dbca

Code and data for the paper "Evaluating Morphological Generalisation in Machine Translation by Distribution-Based Compositionality Assessment"

scripts numbered 01-13 are meant to be run in succession
run.sh provides examples of running the scripts
exp/subset-d-1m/data contains the 1M sentence pair dataset
exp/subset-d-1m/splits/*/*/*/ids_{train,test_full}.txt.gz contain the data splits with different compound divergences and different random initialisations

Data is from the Tatoeba Challenge data release (eng-fin set)
Data filtering is done using OpusFilter
Morphological parsing is done using TNPP, CoNLL-U format parsed using this parser
Data split algorithm uses PyTorch
Tokenisers are trained using sentencepiece
Translation systems are trained with OpenNMT-py
Evaluating translations is done with sacreBLEU

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
exp/subset-d-1m		exp/subset-d-1m
figures		figures
tests		tests
utils		utils
.gitignore		.gitignore
01-select-corpora.sh		01-select-corpora.sh
02-opusfilter.sh		02-opusfilter.sh
03-insert-ids.sh		03-insert-ids.sh
04-tnpp-parse.sh		04-tnpp-parse.sh
05-prepare-divide-data.sh		05-prepare-divide-data.sh
06-divide.sh		06-divide.sh
07-prep-nmt-data.sh		07-prep-nmt-data.sh
08-train-spm.sh		08-train-spm.sh
09-build-vocab.sh		09-build-vocab.sh
10-train-nmt-model.sh		10-train-nmt-model.sh
11-translate.sh		11-translate.sh
12-translation-eval.sh		12-translation-eval.sh
13-significance-tests-simple.sh		13-significance-tests-simple.sh
13-significance-tests.sh		13-significance-tests.sh
README.md		README.md
divide.py		divide.py
prep_onmt_data.py		prep_onmt_data.py
prepare_divide_data.py		prepare_divide_data.py
run.sh		run.sh