Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could I get you 300's pairs to check whether I recover the model exactly? #2

Open
li3cmz opened this issue Dec 6, 2019 · 10 comments
Open

Comments

@li3cmz
Copy link

li3cmz commented Dec 6, 2019

It maybe include sample_300.txt sample_300_tgt.txt and pred.txt.

Looking forward to your reply~

@gmftbyGMFTBY
Copy link
Owner

gmftbyGMFTBY commented Dec 6, 2019

Yes, reproduce the performance of RUBER need the human annotation.
But I'm sorry that I didn't save them.
You can try to annotate the responses by yourself and check the correlation.
In my work, I ask three students from BFS(Beijing Foreign School) to annotate the responses.

But I can give you some suggestions:

  • The BERT-RUBER is much better than other automatic evaluation such as BLEU and ROUGE
  • Correlation under 0.2 is questionable
  • During the training, make sure the Acc of the dev and test dataset is higher than 0.6 at least
  • RUBER's performance is very unstable (I attribute this issue to the Bi-GRU. Replacing the RNN with the BERT embedding will be much better. So I recommand you to use the BERT-RUBER instead of the RUBER).

@li3cmz
Copy link
Author

li3cmz commented Dec 6, 2019

Have you trird on any other dataset? And theirs correlation are higher than 0.2?

@gmftbyGMFTBY
Copy link
Owner

Yes, I tried BERT-RUBER on four benchmarks that I mentioned before. The correlations with human judgments are around 0.4 which are much better than BLEU, ROUGE, Greedy Matching.

@gmftbyGMFTBY
Copy link
Owner

Actually, you can try 100 samples to check the performance (100 or 300 samples are all appropriate).
I'm so sorry that I didn't save the logs of the annotations.

@li3cmz
Copy link
Author

li3cmz commented Dec 7, 2019

Get it! Thank you for your help!

@gmftbyGMFTBY
Copy link
Owner

Okay, feel free to raise the issues when you are in trouble in this project. 😄

@gmftbyGMFTBY
Copy link
Owner

gmftbyGMFTBY commented Dec 7, 2019

Oh, I forget something.
0.4 correlation maybe not very precise.
Due to the difference among the datasets, the performance of the RUBER or BERT-RUBER are not very stable, and this is the reason that I try to run 10 times and set the final performance as the averaged results.

Actually, you can just simply compare the performance with the word-overlap-based and embedding-based metrics. (BLEU, ROUGE, METEOR, BERTScore and so on. I will push a new commit that contains the other baseline metrics in a few days).

If the performance of BERT-RUBER is much better than them (around 10% correlation improvement, I guess), you can make sure that you reproduce this learning-based metric.

@li3cmz
Copy link
Author

li3cmz commented Dec 7, 2019

ok. Thanks for detail answers. And have you ever tried trained on dailydialog and directly test on other dataset? How about its performance?

@li3cmz
Copy link
Author

li3cmz commented Dec 7, 2019

And the context can only be one speaker?

@gmftbyGMFTBY
Copy link
Owner

gmftbyGMFTBY commented Dec 7, 2019

  1. I didn't verify the performance of transfer learning which is shown in RUBER. I will verify this aspect in the future. Actually, I think this experiment is not very essential. If the pretrained score model is released, I think the effectiveness will be proven.

  2. You want to make sure that the learning-based metrics can be applied to multi-turn or multi-party dialogue systems. I think it is very easy. You only need to find a good strategy for encoding the conversation context and I think BERT is still useful.

    • Simply feed all the conversation context into the BERT and obtain the sentence embeddings.
    • Feed the utterances in the conversation context to obtain each embedding, then add them up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants