You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just encountered a rather subtle bug in how HyperscanTokenizer handles citations without page numbers. (Feature introduced in #116, so I am to blame!) Minimally reproducible version of the problem:
# Setup
from eyecite.find import get_citations
from eyecite.tokenizers import HyperscanTokenizer
TOKENIZER = HyperscanTokenizer(cache_dir='.hyperscan')
text = "foo, 574 U. S. ___, bar"
# Returns None
citations = get_citations(text, tokenizer=TOKENIZER)
citations[0].edition_guess
# Returns U.S. (correctly)
citations = get_citations(text)
citations[0].edition_guess
After closer review, it appears that the ending comma in the string (e.g., 574 U. S. ___,) is essential for this bug. I have not fully investigated the root cause, but I think that this may be because the closing comma triggers a match with an additional regex, e.g., the short form regex. (The key to this bug is that the citation token is matched by multiple regexes. Which is not itself a problem, but it currently causes the buggy downstream behavior.)
mattdahl
added a commit
to mattdahl/eyecite
that referenced
this issue
Jan 19, 2023
Just encountered a rather subtle bug in how
HyperscanTokenizer
handles citations without page numbers. (Feature introduced in #116, so I am to blame!) Minimally reproducible version of the problem:The problem occurs because of how
HyperscanTokenizer
deals with citation tokens that are matched by multiple regexes: https://github.com/freelawproject/eyecite/blob/main/eyecite/tokenizers.py#L309. Rather than discarding one match or the other, it attempts to merge them: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L561. However, when the citations' editions are merged, duplicates are not removed. And we only setedition_guess
if there is only one candidate edition: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L247. So it erroneously records the citation as having two candidate editions, even when those two candidates are the same.PR to follow...
The text was updated successfully, but these errors were encountered: