-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could I get you 300's pairs to check whether I recover the model exactly? #2
Comments
Yes, reproduce the performance of RUBER need the human annotation. But I can give you some suggestions:
|
Have you trird on any other dataset? And theirs correlation are higher than 0.2? |
Yes, I tried BERT-RUBER on four benchmarks that I mentioned before. The correlations with human judgments are around 0.4 which are much better than BLEU, ROUGE, Greedy Matching. |
Actually, you can try 100 samples to check the performance (100 or 300 samples are all appropriate). |
Get it! Thank you for your help! |
Okay, feel free to raise the issues when you are in trouble in this project. 😄 |
Oh, I forget something. Actually, you can just simply compare the performance with the word-overlap-based and embedding-based metrics. (BLEU, ROUGE, METEOR, BERTScore and so on. I will push a new commit that contains the other baseline metrics in a few days). If the performance of BERT-RUBER is much better than them (around 10% correlation improvement, I guess), you can make sure that you reproduce this learning-based metric. |
ok. Thanks for detail answers. And have you ever tried trained on dailydialog and directly test on other dataset? How about its performance? |
And the context can only be one speaker? |
|
It maybe include sample_300.txt sample_300_tgt.txt and pred.txt.
Looking forward to your reply~
The text was updated successfully, but these errors were encountered: