NER: Ensure zero-cost sequence with sentence split in entity #12465

danieldk · 2023-03-24T14:40:38Z

Description

If we use a sentence splitter as one of the annotating components during training, an entity can become split in the predicted Doc. Before this change, training would fail, because no zero-cost transition sequence could be found.

This fixes two scenarios:

When the gold action is B and a split occurs after the current token, the BEGIN action is invalid. However, this was the only possible zero-cost action. This change makes OUT a zero-cost action in this case.
When the gold action is I and a split occurs after the current token, the IN action is invalid, removing the only zero-cost action. This change makes LAST a zero-cost action, so that the entity can be properly closed.

Types of change

Bugfix

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

If we use a sentence splitter as one of the annotating components during training, an entity can become split in the predicted `Doc`. Before this change, training would fail, because no zero-cost transition sequence could be found. This fixes two scenarios: 1. When the gold action is `B` and a split occurs after the current token, the `BEGIN` action is invalid. However, this was the only possible zero-cost action. This change makes `OUT` a zero-cost action in this case. 2. When the gold action is `I` and a split occurs after the current token, the `IN` action is invalid, removing the only zero-cost action. This change makes `LAST` a zero-cost action, so that the entity can be properly closed.

danieldk · 2023-03-24T14:41:17Z

One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?

honnibal · 2023-06-27T10:55:04Z

One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?

U would introduce a new incorrect prediction, so it should have cost 1 as well. The cost is false_negatives + false_positives. So if we can't recover a correct entity we should give up on it and not return an entity that has partial overlap within it. And once we've started an entity that's incorrect, we can end it whenever --- so long as we're not making a gold entity unrecoverable.

danieldk added bug Bugs and behaviour differing from documentation feat / ner Feature: Named Entity Recognizer labels Mar 24, 2023

danieldk added 2 commits March 24, 2023 15:42

Typo

ff2fce0

Consistency

2b4f0c6

svlandeg changed the base branch from master to main January 29, 2024 09:50

svlandeg changed the base branch from v4 to v5 May 2, 2024 14:51

svlandeg added the 🔜 v5.0 Related to upcoming v5.0 label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER: Ensure zero-cost sequence with sentence split in entity #12465

NER: Ensure zero-cost sequence with sentence split in entity #12465

danieldk commented Mar 24, 2023

danieldk commented Mar 24, 2023

honnibal commented Jun 27, 2023

NER: Ensure zero-cost sequence with sentence split in entity #12465

Are you sure you want to change the base?

NER: Ensure zero-cost sequence with sentence split in entity #12465

Conversation

danieldk commented Mar 24, 2023

Description

Types of change

Checklist

danieldk commented Mar 24, 2023

honnibal commented Jun 27, 2023