You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for publishing your code and data. It was organized well and convenient to follow. 👍
I have trained a sentence-level Transformer using context-agnostic training data and successfully reproduced the BLEU score (33.91 in emnlp2019 paper) on context-aware test set (remove bpe and '_eos', lowercase, 4 segments as a long sentence).
But I found that the "test.dst" in Docrepair dataset is different with "test.ru" in Contest-aware dataset.
The first line in "test.dst" in Docrepair dataset:
вчера ночью кто-то вломился в мой дом и украл эту урод `скую футболку . _eos да ... _eos я не верю в это . _eos она слишком свободная на мне , чувак .
The first line in "test.ru" in Contest-aware dataset:
Вчера ночью кто-то вломился в мой дом и украл эту уродскую футболку . _eos Да ... _eos Я не верю в это . _eos Она слишком свободная на мне , чувак .
Except for lowercasing, "test.dst" in Docrepair dataset has many " `" splitting some token (e.g., "уродскую" in the first line).
I want to know that:
Which reference is correct?
Does Docrepair dataset have different tokenization with context-aware dataset?
Looking forward to your reply. :)
The text was updated successfully, but these errors were encountered:
Dear authors,
Thank you for publishing your code and data. It was organized well and convenient to follow. 👍
I have trained a sentence-level Transformer using context-agnostic training data and successfully reproduced the BLEU score (33.91 in emnlp2019 paper) on context-aware test set (remove bpe and '_eos', lowercase, 4 segments as a long sentence).
But I found that the "test.dst" in Docrepair dataset is different with "test.ru" in Contest-aware dataset.
The first line in "test.dst" in Docrepair dataset:
The first line in "test.ru" in Contest-aware dataset:
Except for lowercasing, "test.dst" in Docrepair dataset has many " `" splitting some token (e.g., "уродскую" in the first line).
I want to know that:
Looking forward to your reply. :)
The text was updated successfully, but these errors were encountered: