Code and data for the paper "Evaluating Morphological Generalisation in Machine Translation by Distribution-Based Compositionality Assessment"
Note: for updates, see https://github.com/aalto-speech/dbca
- scripts numbered 01-13 are meant to be run in succession
- run.sh provides examples of running the scripts
- exp/subset-d-1m/data contains the 1M sentence pair dataset
exp/subset-d-1m/splits/*/*/*/ids_{train,test_full}.txt.gz
contain the data splits with different compound divergences and different random initialisations
- Data is from the Tatoeba Challenge data release (eng-fin set)
- Data filtering is done using OpusFilter
- Morphological parsing is done using TNPP, CoNLL-U format parsed using this parser
- Data split algorithm uses PyTorch
- Tokenisers are trained using sentencepiece
- Translation systems are trained with OpenNMT-py
- Evaluating translations is done with sacreBLEU