Figure-1: Shows a Cuneiform inscription, extracted from actual tablets.
Sumerian: pisan-dub-ba sza3-bi su-ga sag-nig2-gur11-ra u3 zi-ga lu2-kal-la i3-gal2 ...
English: Basket-of-tablets: therefroms, restitutions, debits, and credits, of Lukalla are here; ...

Sumerian-English Neural Machine Translation

As a part of the MTAAC project at CDLI, we aim to build an end-to-end NMT Pipeline while making use of the extensive monolingual Sumerian Data.

Previous models that have been used to carry out English<-->Sumerian Translation have only made use of the available parallel corpora. Presently we have only about 50K extracted sentences for both languages in the parallel corpora, whereas around 1.47M sentences in the Sumerian monolingual corpus.

This huge amount of monolingual data can be used to improve the NMT system by combining it with techniques like Back Translation, Tranfer Learning and Dual Learning which have proved specially useful for Low-Resource languages like Sumerian which have a limited amount of parallel data. Moreover, we also look to implement models like XLM and MASS for the same.

Requirements

- Python 3.5.2 or higher
- NumPy
- Pandas
- PyTorch
- Torch Text
- OpenNMT-py
- fairseq

|__ translation/ --> all translation models used for Sumerian-English Translation 
        |__ transformer/ --> Supervised NMT using Vanilla Transformer
                |__ runTransformerSumEn.sh --> to perform training
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ backtranslation/ --> fairseq usgae for Back Translation using Vanilla Transformers
        |__ backtranslation-onmt/ --> OpenNMT usage for Back Translation using Vanilla Transformers
                |__ backtranslateONMT.py --> to translate all Sumerian Text in a given shard using weights from the previous iteration
                |__ stack.py --> To stack the backtranslated sentences to the parallel corpora for training
                |__ runTransformerSumEn.sh --> To retrain the transformer model using the updated parallel data from the last step
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ XLM/ --> Unsupervised and Semi-Supervised NMT using Cross-Lingual Langual Model Pretraining
                |__ XLM/ --> directory containing all model, data preperation and inference scripts
                |__ models.txt --> lists the possible commands and parameter combinations for XLM training and inference.
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-unmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-snmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining 
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.

|__ dataset/ --> All Sumerian Language related textual dataset by CDLI
        |__ README.md --> Gives detailed description of the dataset and the different sub-folders.
        |__ dataToUse/ --> Contains all the parallel data divided among traing, test and dev sets, in 4 different categories
                |__ UrIIICompSents/ --> UrIII Admin Data with complete sentence translations
                |__ AllCompSents/ --> All kinds of Sumerian Data with complete sentence translations
                |__ UrIIILineByLine/ --> UrIII Admin Data with line by line translations
                |__ AllLineByLIne/ --> All kinds of Sumerian Dtaa with line by line translations
        |__ cleaned/ --> Contains data after cleaning using the helper scripts, including the monolingual data. Divided in the same 4 categories.
        |__ original/ --> Contains all of the data before cleaning
        |__ oldFormat/ --> Contains data from last year, for comparison

Refer to the README of each folder and sub-folder to throughly know them and to reproduce the translation models

Results

Table-1: Sumerian-English Machine Translation.
All numeric values other than those in Human Evaluation represent the BLEU Score.

Visualisations and Interpretations

Figure-2: Selected output tokens for Sumerian Input text of ”sze-ba geme2 usz-bar kiszib3 ur-dasznan ugula”, which translates to ”barley rations of the female weavers under seal of UrAnan the foreman”

Figure-3: Feature Ablation and attention Attributions, respectively,
for a span of input and output text through the Data Augmented XLM

Mentors:

Niko Schenk
Ravneet Punia

Tasks:

Preparing the parallel and monolingual texts for final usage. Using methods like BPE and BBPE to tokenize the text.
Implementing the Vanilla Transformer for Sumerian to English as well as English to Sumerian
Back Translation using Sumerian Monolingual data
Transfer Learning from pre-trained models of other languages
XLM for Unsupervised NMT.
XLM for Semi-Supervised NMT
MASS for Unsupervised NMT.
MASS for Semi-Supervised NMT.
Pre-training using Augmented Data
Interpretation of the NMT Models

...

For an end-to-end translation pipeline making use of translation models from this repository, refer to the cdli-gh/Sumerian_Translation-Pipeline project, where you can give an ATF file containing Sumerian sentences as input and get an ATF file with corresponding English translations as the output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sumerian-English Neural Machine Translation

Table of Contents

Repository Structure

Results

Visualisations and Interpretations

Mentors:

Tasks:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sumerian-English Neural Machine Translation

Table of Contents

Repository Structure

Results

Visualisations and Interpretations

Mentors:

Tasks: