Bilingual sentence similarity classifier based on optimising word alignments using Tensorflow.
This repo implements a sentence similarity classifier model using Tensorflow. Similarity classification is based on the ideas introduced by Carpuat et al., 2017 and similar to Vyas et al., 2018, Schwenk, 2018 and Grégoire et al., 2018. The code borrows many of the concepts and architecture presented in Legrand et al., 2016.
Details on the implementation and experiments are published in:
- MinhQuang Pham, Josep Crego, Jean Senellart and François Yvon. Fixing Translation Divergences in Parallel Corpora for Neural MT. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018. October 31–November 4. Brussels, Belgium.
The next picture, shows an example of similarity classification for the sentence pair:
What do you feel ? Not . ||| Que ressentez-vous ?
As it can be seen, the model outputs:
- a matrix with alignment scores,
- word aggregation scores (shown next to each source/target word) and
- an overall sentence pair similarity score (+0.1201).
In the previous paper we show that divergent sentences can be filtered out (using the sentence similarity score) and that some divergences can be fixed (following alignment scores), guiding in both cases to outperform accuracy when compared to neural MT systems using the original corpora. For our experiments we used the English-French OpenSubtitles and the English-German Paracrawl corpora.
pip install requirements.txt
A docker can also be built integrating all requirements with:
docker build -t systran/similarity -f Dockerfile .
In order to learn our similarity model we better preprocess our training data with any tokenisation toolkit, basically aiming at reducing the vocabulary size. Any subtokenisation toolkit (such as BPE) can also be used. In our experiments we used the default tokenisation scheme implemented in OpenNMT performing minimal tokenisation without subtokenisation. Any OpenNMT tokenization can be performed on the fly on the input data given a json configuration file.
After tokenisation, the most frequent |Vs| source and |Vt| target words are considered to be part of the source and target vocabularies respectively. The remaining will be mapped to a special UNK token. In our experiments we used |Vs| = |Vt| = 50,000 words.
Any initialisation of source and target word embeddings can be used. In our experiments we initialised both source and target embeddings using fastText with |Es| = |Et| = 256 cells. Embeddings were further refined using MUSE. Note that word embeddings are not needed to learn the similarity model.
To generate some training examples we will need to perform word alignments and POS-tagging of the source sentences. In our experiments we used fast_align and Freeling to perform word alignment and English POS tagging respectively. Note that neither word alignments nor POS tags are needed to learn the similarity model.
Once the training parallel corpora is preprocessed we are ready to prepare our training examples:
python -u src/build_data.py
-seq_size INT : sentences larger than this number of src/tgt words are filtered out [50]
-max_sents INT : Consider this number of sentences per batch (0 for all) [0]
-seed INT : seed for randomness [1234]
-shuffle : shuffle data
-debug : debug mode
-h : this help
* -data FILE : training data
-mode STRING : how data examples are generated (p: parallel, u:uneven, i:insert, r:replace d:delete) [p]
-replace FILE : equivalent sequences (needed when -data_mode contains r)
+ Options marked with * must be set. The rest have default values.
The input data file contains one sentence pair per line, with the following fields separated by TABs:
- source sentence
- target sentence
- source/target alignments
- source part-of-speeches
Why wait for the Euro ? Pourquoi attendre l' Euro ? 0-0 1-1 2-1 3-2 4-3 5-4 WRB VB IN DT NNP .
alternatively, if data path can be multiple files separated by comma.
(The last two fields are optional)
Available modes:
- 'p': parallel sentences
Why wait for the Euro ? Pourquoi attendre l' Euro ? -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
Parallel sentences from the bitext.
- 'u': uneven sentences
Why wait for the Euro ? Cela peut donc se produire . 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Uneven sentences from the bitext.
- 'i': insert sentence
Why wait for the Euro ? Pourquoi attendre l' Euro ? Il existe un précédant . -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 1.0 1.0 1.0 1.0
Inserted the sentence 'Il existe un précédant .' at the end of the original target sentence.
- 'd': delete sequence
Why wait for the Euro ? l' Euro ? 1.0 1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
Deleted the sequence 'Pourquoi attendre' from the original target sentence.
(needs word alignments in input FILE)
- 'r': replace sequence with equivalent part-of-speech
Where wait for the Euro ? Pourquoi attendre l ' Euro ? 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0
The sequence 'Why' of the original source sentence has been replaced by 'Where' having the same POS tags.
(needs word alignments and source POS-tags in -data FILE and equivalent sequences in -replace FILE)
python -u src/similarity.py
* -mdir FILE : directory to save/restore models
-seq_size INT : sentences larger than this number of src/tgt words are filtered out [50]
-batch_size INT : number of examples per batch [32]
-seed INT : seed for randomness [1234]
-debug : debug mode
[LEARNING OPTIONS]
* -trn FILE : training data
-dev FILE : validation data
-src_tok FILE : if provided, json tokenization options for onmt tokenization, points to vocabulary file
-src_voc FILE : vocabulary of src words (needed to initialize learning)
-tgt_tok FILE : if provided, json tokenization options for onmt tokenization, points to vocabulary file
-tgt_voc FILE : vocabulary of tgt words (needed to initialize learning)
-src_emb FILE : embeddings of src words (needed to initialize learning)
-tgt_emb FILE : embeddings of tgt words (needed to initialize learning)
-src_emb_size INT : size of src embeddings if -src_emb not used
-tgt_emb_size INT : size of tgt embeddings if -tgt_emb not used
-src_lstm_size INT : hidden units for src bi-lstm [256]
-tgt_lstm_size INT : hidden units for tgt bi-lstm [256]
-lr FLOAT : initial learning rate [1.0]
-lr_decay FLOAT : learning rate decay [0.9]
-lr_method STRING : GD method either: adam, adagrad, adadelta, sgd, rmsprop [adagrad]
-aggr TYPE : aggregation operation: sum, max, lse [lse]
-r FLOAT : r for lse [1.0]
-dropout FLOAT : dropout ratio [0.3]
-mode STRING : mode (alignment, sentence) [alignment]
-max_sents INT : Consider this number of sentences per batch (0 for all) [0]
-n_epochs INT : train for this number of epochs [1]
-report_every INT : report every this many batches [1000]
+ Options marked with * must be set. The rest have default values.
+ If -mdir exists in learning mode, learning continues after restoring the last model
+ Training data is shuffled at every epoch
python -u src/similarity.py
* -mdir FILE : directory to save/restore models
-batch_size INT : number of examples per batch [32]
-seed INT : seed for randomness [1234]
-debug : debug mode
[INFERENCE OPTIONS]
-epoch INT : epoch to use ([mdir]/epoch[epoch] must exist, by default the latest one in mdir)
* -tst FILE : testing data
-output FILE : output file [- by default is STDOUT]
-q : quiet mode, just output similarity score
-show_matrix : output formatted alignment matrix (mode must be alignment)
-show_svg : output alignment matrix using svg-like html format (mode must be alignment)
-show_align : output source/target alignment matrix (mode must be alignment)
-show_last : output source/target last vectors
-show_aggr : output source/target aggr vectors
+ Options marked with * must be set. The rest have default values.
+ -show_last, -show_aggr and -show_align can be used at the same time
If files tokenization_src.json
or tokenization_tgt.json
are found in the model directory, the corresponding OpenNMT tokenization and sub-tokenization is performed on the fly - for instance:
{
"mode": "aggressive",
"vocabulary": "vocab.en"
}
python -u src/fix.py [-tau INT] [-nbest INT] [-max_sim FLOAT] [-use_punct] < FILE_WITH_ALIGNMENTS