When developing the NMT system, we noticed that the byte pair encoding (BPE) segmentation tool can in translation time create units that are out-of-vocabulary for the NMT system, however be parts of longer words. This usually happens when users translate wrong language content or when something is, e.g., given in the original language as an example (e.g., writing of person names in brackets, etc.). In order to restrict the NMT model (i.e., not to split unknown words in multiple parts), we adapted the AmuNMT decoder so that after byte pair encoding the split unknown words would be merged back together. Our testing results show that this improves out-of-vocabulary word handling in the NMT model.
The AmuNMT decoder was modified to address the two following problems:
- The first modification deals with word alignment. The AmuNMT decoder can return a soft word alignment matrix - the probability of each target word being aligned to each source word. But since source text is split into sub-word units before translation (using the BPE algorithm to divide words in smaller parts), the alignment matrix is also returned for each source and target sub-word unit. Obviously, the sub-word alignment matrix is not useful after merging segments in full words. Therefore, the AmuNMT decoder was modified to merge the rows and columns of the alignment matrix so that they would correspond to full words instead of sub-word units. Each row gives the alignment probabilities for each target word and each row sums to one. Therefore, when merging rows, the average is calculated. Columns give the alignment probabilities of the source words. When merging source segments (i.e., two columns), the values are summed as they contribute to the total probability for each target word.
- The second set of modifications deals with unknown words. A neural network can only be trained with a limited dictionary size. The BPE mechanism is used to deal with rarely seen words - they are split into smaller sub-word units, which can easily fit in the limited dictionary size. In a worst case scenario, words are split into separate alphabet letters (if the alphabet is known to the BPE model) and usually there are very few letters in a given language. But the BPE word splitting algorithm cannot handle all unknown words. For example, someone could try to translate words in a foreign alphabet, and since that alphabet was not used in the translation system's training data, it would also not be present in the system's sub-word unit dictionary. Internally the AmuNMT decoder uses a specific
UNK
word placeholder for each unseen sub-word unit. But UNKs are rarely seen in training data and, if they are seen, then in very different contexts. This may result in the system returning quite random results when asked to translate unknown words. It would be much better for the end user if translation system returned unknown words unchanged (especially if the words are written in a different alphabet from the source language), in the original form, instead of producing wrong translations.- The first modification for unknown words makes AmuNMT to treat the whole word as an unknown word if any of its sub-word units is not found in the dictionary. It deals with the problem that a word in a foreign alphabet (or if a BPE segment of the target language, which does not exist in the source language, was used to split the source language word) was split into segments including one or more UNK segments.
- The second modification replaces the AmuNMT unknown word placeholder (
UNK
) with a placeholder that is seen many times in translation system's training data and is usually translated to the same placeholder without any changes. Our data processing workflow treats non-translatable tokens specially, for instance various identifiers found in texts. Therefore, we used the placeholderβIDβ
to also translate unknown words. After translation, the placeholders are replaced with the original unknown word. For this, we use the word alignment matrix to find which placeholder in the translation corresponds to which placeholder in the source sentence. Therefore, if the system is trained to always translate a specific placeholder as the same placeholder, this placeholder can then be used to copy unknown words from the source side to the target side without any changes.
The changes to the original AmuNMT decoder can be found in the tilde-nlp fork of the AmuNMT decoder.