Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER: Ensure zero-cost sequence with sentence split in entity #12465

Draft
wants to merge 3 commits into
base: v5
Choose a base branch
from

Conversation

danieldk
Copy link
Contributor

Description

If we use a sentence splitter as one of the annotating components during training, an entity can become split in the predicted Doc. Before this change, training would fail, because no zero-cost transition sequence could be found.

This fixes two scenarios:

  1. When the gold action is B and a split occurs after the current token, the BEGIN action is invalid. However, this was the only possible zero-cost action. This change makes OUT a zero-cost action in this case.
  2. When the gold action is I and a split occurs after the current token, the IN action is invalid, removing the only zero-cost action. This change makes LAST a zero-cost action, so that the entity can be properly closed.

Types of change

Bugfix

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

If we use a sentence splitter as one of the annotating components during
training, an entity can become split in the predicted `Doc`. Before this
change, training would fail, because no zero-cost transition sequence
could be found.

This fixes two scenarios:

1. When the gold action is `B` and a split occurs after the current
   token, the `BEGIN` action is invalid. However, this was the only
   possible zero-cost action. This change makes `OUT` a zero-cost
   action in this case.
2. When the gold action is `I` and a split occurs after the current
   token, the `IN` action is invalid, removing the only zero-cost
   action. This change makes `LAST` a zero-cost action, so that the
   entity can be properly closed.
@danieldk
Copy link
Contributor Author

One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?

@danieldk danieldk added bug Bugs and behaviour differing from documentation feat / ner Feature: Named Entity Recognizer labels Mar 24, 2023
@honnibal
Copy link
Member

One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?

U would introduce a new incorrect prediction, so it should have cost 1 as well. The cost is false_negatives + false_positives. So if we can't recover a correct entity we should give up on it and not return an entity that has partial overlap within it. And once we've started an entity that's incorrect, we can end it whenever --- so long as we're not making a gold entity unrecoverable.

@svlandeg svlandeg changed the base branch from master to main January 29, 2024 09:50
@svlandeg svlandeg changed the base branch from v4 to v5 May 2, 2024 14:51
@svlandeg svlandeg added the 🔜 v5.0 Related to upcoming v5.0 label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / ner Feature: Named Entity Recognizer 🔜 v5.0 Related to upcoming v5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants