Could I get you 300's pairs to check whether I recover the model exactly? #2

li3cmz · 2019-12-06T02:12:38Z

It maybe include sample_300.txt sample_300_tgt.txt and pred.txt.

Looking forward to your reply~

gmftbyGMFTBY · 2019-12-06T10:31:08Z

Yes, reproduce the performance of RUBER need the human annotation.
But I'm sorry that I didn't save them.
You can try to annotate the responses by yourself and check the correlation.
In my work, I ask three students from BFS(Beijing Foreign School) to annotate the responses.

But I can give you some suggestions:

The BERT-RUBER is much better than other automatic evaluation such as BLEU and ROUGE
Correlation under 0.2 is questionable
During the training, make sure the Acc of the dev and test dataset is higher than 0.6 at least
RUBER's performance is very unstable (I attribute this issue to the Bi-GRU. Replacing the RNN with the BERT embedding will be much better. So I recommand you to use the BERT-RUBER instead of the RUBER).

li3cmz · 2019-12-06T12:32:54Z

Have you trird on any other dataset? And theirs correlation are higher than 0.2?

gmftbyGMFTBY · 2019-12-06T12:38:28Z

Yes, I tried BERT-RUBER on four benchmarks that I mentioned before. The correlations with human judgments are around 0.4 which are much better than BLEU, ROUGE, Greedy Matching.

gmftbyGMFTBY · 2019-12-06T14:18:17Z

Actually, you can try 100 samples to check the performance (100 or 300 samples are all appropriate).
I'm so sorry that I didn't save the logs of the annotations.

li3cmz · 2019-12-07T00:55:47Z

Get it！ Thank you for your help!

gmftbyGMFTBY · 2019-12-07T00:59:46Z

Okay, feel free to raise the issues when you are in trouble in this project. 😄

gmftbyGMFTBY · 2019-12-07T04:37:05Z

Oh, I forget something.
0.4 correlation maybe not very precise.
Due to the difference among the datasets, the performance of the RUBER or BERT-RUBER are not very stable, and this is the reason that I try to run 10 times and set the final performance as the averaged results.

Actually, you can just simply compare the performance with the word-overlap-based and embedding-based metrics. (BLEU, ROUGE, METEOR, BERTScore and so on. I will push a new commit that contains the other baseline metrics in a few days).

If the performance of BERT-RUBER is much better than them (around 10% correlation improvement, I guess), you can make sure that you reproduce this learning-based metric.

li3cmz · 2019-12-07T06:35:40Z

ok. Thanks for detail answers. And have you ever tried trained on dailydialog and directly test on other dataset? How about its performance?

li3cmz · 2019-12-07T07:02:01Z

And the context can only be one speaker?

gmftbyGMFTBY · 2019-12-07T07:08:51Z

I didn't verify the performance of transfer learning which is shown in RUBER. I will verify this aspect in the future. Actually, I think this experiment is not very essential. If the pretrained score model is released, I think the effectiveness will be proven.
You want to make sure that the learning-based metrics can be applied to multi-turn or multi-party dialogue systems. I think it is very easy. You only need to find a good strategy for encoding the conversation context and I think BERT is still useful.
- Simply feed all the conversation context into the BERT and obtain the sentence embeddings.
- Feed the utterances in the conversation context to obtain each embedding, then add them up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could I get you 300's pairs to check whether I recover the model exactly? #2

Could I get you 300's pairs to check whether I recover the model exactly? #2

li3cmz commented Dec 6, 2019 •

edited

Loading

gmftbyGMFTBY commented Dec 6, 2019 •

edited

Loading

li3cmz commented Dec 6, 2019

gmftbyGMFTBY commented Dec 6, 2019

gmftbyGMFTBY commented Dec 6, 2019

li3cmz commented Dec 7, 2019

gmftbyGMFTBY commented Dec 7, 2019

gmftbyGMFTBY commented Dec 7, 2019 •

edited

Loading

li3cmz commented Dec 7, 2019

li3cmz commented Dec 7, 2019

gmftbyGMFTBY commented Dec 7, 2019 •

edited

Loading

Could I get you 300's pairs to check whether I recover the model exactly? #2

Could I get you 300's pairs to check whether I recover the model exactly? #2

Comments

li3cmz commented Dec 6, 2019 • edited Loading

gmftbyGMFTBY commented Dec 6, 2019 • edited Loading

li3cmz commented Dec 6, 2019

gmftbyGMFTBY commented Dec 6, 2019

gmftbyGMFTBY commented Dec 6, 2019

li3cmz commented Dec 7, 2019

gmftbyGMFTBY commented Dec 7, 2019

gmftbyGMFTBY commented Dec 7, 2019 • edited Loading

li3cmz commented Dec 7, 2019

li3cmz commented Dec 7, 2019

gmftbyGMFTBY commented Dec 7, 2019 • edited Loading

li3cmz commented Dec 6, 2019 •

edited

Loading

gmftbyGMFTBY commented Dec 6, 2019 •

edited

Loading

gmftbyGMFTBY commented Dec 7, 2019 •

edited

Loading

gmftbyGMFTBY commented Dec 7, 2019 •

edited

Loading