diff --git a/index.md b/index.md index b8eeb10..17bc1b4 100644 --- a/index.md +++ b/index.md @@ -22,13 +22,14 @@ I (still) [teach NLP](https://github.com/yandexdataschool/nlp_course) at the [Ya ## News -* 03-06/2021 Invited talks at: [Stanford NLP Seminar](https://nlp.stanford.edu/seminar/), CornellNLP, [MT@UPC](https://mt.cs.upc.edu/seminars/), CambridgeNLP, [DeeLIO workshop at NAACL 2021](https://sites.google.com/view/deelio-ws/), ...TBU. -* 10-12/2020 Invited talks at: CMU, [USC ISI](https://nlg.isi.edu/nl-seminar/), ENS Paris, [ML Street Talk](https://www.youtube.com/watch?v=Q0kN_ZHHDQY). +* 06/2021 Our [Source and Target Contributions paper](https://arxiv.org/pdf/2010.10907.pdf) is _accepted to __ACL__ 2021_. +* 03-06/2021 Invited talks at: [Stanford NLP Seminar](https://nlp.stanford.edu/seminar/), CornellNLP, [MT@UPC](https://mt.cs.upc.edu/seminars/), CambridgeNLP, [DeeLIO workshop at NAACL 2021](https://sites.google.com/view/deelio-ws/). +* 10-12/2020 Invited talks at: CMU, [USC ISI](https://nlg.isi.edu/nl-seminar/), ENS Paris, [ML Street Talk](https://www.youtube.com/watch?v=Q0kN_ZHHDQY). * 09/2020 __2__ papers _accepted to __EMNLP__ 2020_. -* 06-08/2020 Invited talks at: MIT, DeepMind, [Grammarly AI](https://grammarly.ai/information-theoretic-probing-with-minimum-description-length/), Unbabel, [NLP with Friends](https://nlpwithfriends.com). +* 06-08/2020 Invited talks at: MIT, DeepMind, [Grammarly AI](https://grammarly.ai/information-theoretic-probing-with-minimum-description-length/), Unbabel, [NLP with Friends](https://nlpwithfriends.com). * 04/2020 Our [BPE-dropout](https://arxiv.org/pdf/1910.13267.pdf) is _accepted to __ACL__ 2020_. * 01/2020 I'm [awarded Facebook PhD Fellowship](https://research.fb.com/blog/2020/01/announcing-the-recipients-of-the-2020-facebook-fellowship-awards/). -* 01/2020 Invited talks at: [Rasa](https://www.meetup.com/ru-RU/Bots-Berlin-Build-better-conversational-interfaces-with-AI/events/267058207/), Google Research Berlin, [Naver Labs Europe](https://europe.naverlabs.com/research/seminars/analyzing-information-flow-in-transformers/), NLP track at [Applied Machine Learning Days at EPFL](https://appliedmldays.org/tracks/ai-nlp). +* 01/2020 Invited talks at: [Rasa](https://www.meetup.com/ru-RU/Bots-Berlin-Build-better-conversational-interfaces-with-AI/events/267058207/), Google Research Berlin, [Naver Labs Europe](https://europe.naverlabs.com/research/seminars/analyzing-information-flow-in-transformers/), NLP track at [Applied Machine Learning Days at EPFL](https://appliedmldays.org/tracks/ai-nlp). * 08-09/2019 __2__ papers _accepted to __EMNLP__ 2019_, one at __NeurIPS__ _2019_. * 05/2019 __2__ papers _accepted to __ACL__ 2019_, one is oral. diff --git a/posts.html b/posts.html index 9ec4977..90aadcd 100644 --- a/posts.html +++ b/posts.html @@ -2,7 +2,7 @@ layout: default title: Blog description: Intuitive explanations for some of my papers. -menu: no +menu: yes order: 1 --- diff --git a/posts/nmt_inside_out.html b/posts/nmt_inside_out.html index d4a0e90..ed53ba4 100644 --- a/posts/nmt_inside_out.html +++ b/posts/nmt_inside_out.html @@ -1048,21 +1048,18 @@
What will our model do: ignore the source or the prefix? Previous work shows that, in principle, our model can ignore either the source or the prefix.
-We see that at early generation steps, when the prefix is short, the model “recovers”. - It ignores the prefix: we see very high source contribution. - - But later, when the prefix is long, the model starts to ignore the source: the source contribution - drops down significantly. +
As we see from the results, the model tends to fall into hallucination mode even when a random prefix + is very short, e.g. + one token: we see a large drop of source influence for all positions. + This + behavior is what we would expect when a model is hallucinating, and there is no self-recovery ability.
Overall, a model’s decision of which of these two contradicting parts to support - changes depending on the prefix length. If the prefix is short, it relies on the source; - if the prefix is long, it relies on the prefix. -
+What will our model do: ignore the source or the prefix? According to previous work, it can do either!
-As we see from the results, it depends on the prefix length. When a random prefix is short, - the model recovers: it ignores the prefix and bases its predictions mostly on the source. When a random prefix becomes longer, - the model's choice shifts towards ignoring the source: source contribution drops drastically. This - behavior is what we would expect when a model is hallucinating. +
+ As we see from the results, the model tends to fall into hallucination mode even when a random prefix + is very short, e.g. + one token: we see a large drop of source influence for all positions. + This + behavior is what we would expect when a model is hallucinating, and there is no self-recovery ability.
Next, we see that with a random prefix, the entropy of contributions is very high and is roughly constant across @@ -421,11 +440,15 @@
We want to check to what extent models that suffer from exposure bias to differing extent are prone to hallucinations. - For this, we feed fluent but unrelated to source prefixes and look whether a model is likely to fall into a - language modeling regime, i.e., to what extent it ignores the source. + For this, we feed different types of prefixes, + prefixes of either model-generated translations or random sentences, and look at model behavior. + While conditioning on model-generated prefixes shows what happens in the standard setting at model's inference, + random prefixes (fluent but unrelated to source prefixes) show + whether the model is likely to fall into a + language modeling regime, i.e., to what extent it ignores the source and hallucinates.
How: Feed Random Prefixes, Look at Contributions - + -The results confirm our hypothesis: +
The results for both types of prefixes confirm our hypothesis:
First, we see that, generally, models trained with more data use source more. - Surprisingly, this increase is not spread evenly across positions: - at approximately 80% of the target length, models trained with more data use - source more, but at the last positions, they switch to more actively using the prefix. -
-Second, with more training data, the model becomes more confident in the choice of important tokens: the entropy + Second, with more training data, the model becomes more confident in the choice of important tokens: the entropy of contributions becomes lower (in the paper, we also show entropy of target contributions).