Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences between test.dst of Context-aware dataset and Docrepair dataset #9

Open
xc-kiwiberry opened this issue Jul 29, 2020 · 0 comments

Comments

@xc-kiwiberry
Copy link

xc-kiwiberry commented Jul 29, 2020

Dear authors,

Thank you for publishing your code and data. It was organized well and convenient to follow. 👍

I have trained a sentence-level Transformer using context-agnostic training data and successfully reproduced the BLEU score (33.91 in emnlp2019 paper) on context-aware test set (remove bpe and '_eos', lowercase, 4 segments as a long sentence).

But I found that the "test.dst" in Docrepair dataset is different with "test.ru" in Contest-aware dataset.

The first line in "test.dst" in Docrepair dataset:

вчера ночью кто-то вломился в мой дом и украл эту урод `скую футболку . _eos да ... _eos я не верю в это . _eos она слишком свободная на мне , чувак .

The first line in "test.ru" in Contest-aware dataset:

Вчера ночью кто-то вломился в мой дом и украл эту уродскую футболку . _eos Да ... _eos Я не верю в это . _eos Она слишком свободная на мне , чувак .

Except for lowercasing, "test.dst" in Docrepair dataset has many " `" splitting some token (e.g., "уродскую" in the first line).

I want to know that:

  1. Which reference is correct?
  2. Does Docrepair dataset have different tokenization with context-aware dataset?

Looking forward to your reply. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant