Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyperscanTokenizer erroneously introduces duplicate reporter edition candidates for citations without page numbers #137

Closed
mattdahl opened this issue Jan 19, 2023 · 1 comment

Comments

@mattdahl
Copy link
Contributor

Just encountered a rather subtle bug in how HyperscanTokenizer handles citations without page numbers. (Feature introduced in #116, so I am to blame!) Minimally reproducible version of the problem:

# Setup
from eyecite.find import get_citations
from eyecite.tokenizers import HyperscanTokenizer
TOKENIZER = HyperscanTokenizer(cache_dir='.hyperscan')
text = "foo, 574 U. S. ___, bar"

# Returns None
citations = get_citations(text, tokenizer=TOKENIZER)
citations[0].edition_guess

# Returns U.S. (correctly)
citations = get_citations(text)
citations[0].edition_guess

The problem occurs because of how HyperscanTokenizer deals with citation tokens that are matched by multiple regexes: https://github.com/freelawproject/eyecite/blob/main/eyecite/tokenizers.py#L309. Rather than discarding one match or the other, it attempts to merge them: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L561. However, when the citations' editions are merged, duplicates are not removed. And we only set edition_guess if there is only one candidate edition: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L247. So it erroneously records the citation as having two candidate editions, even when those two candidates are the same.

PR to follow...

@mattdahl
Copy link
Contributor Author

After closer review, it appears that the ending comma in the string (e.g., 574 U. S. ___,) is essential for this bug. I have not fully investigated the root cause, but I think that this may be because the closing comma triggers a match with an additional regex, e.g., the short form regex. (The key to this bug is that the citation token is matched by multiple regexes. Which is not itself a problem, but it currently causes the buggy downstream behavior.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant