HyperscanTokenizer erroneously introduces duplicate reporter edition candidates for citations without page numbers #137

mattdahl · 2023-01-19T19:40:54Z

Just encountered a rather subtle bug in how HyperscanTokenizer handles citations without page numbers. (Feature introduced in #116, so I am to blame!) Minimally reproducible version of the problem:

# Setup
from eyecite.find import get_citations
from eyecite.tokenizers import HyperscanTokenizer
TOKENIZER = HyperscanTokenizer(cache_dir='.hyperscan')
text = "foo, 574 U. S. ___, bar"

# Returns None
citations = get_citations(text, tokenizer=TOKENIZER)
citations[0].edition_guess

# Returns U.S. (correctly)
citations = get_citations(text)
citations[0].edition_guess

The problem occurs because of how HyperscanTokenizer deals with citation tokens that are matched by multiple regexes: https://github.com/freelawproject/eyecite/blob/main/eyecite/tokenizers.py#L309. Rather than discarding one match or the other, it attempts to merge them: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L561. However, when the citations' editions are merged, duplicates are not removed. And we only set edition_guess if there is only one candidate edition: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L247. So it erroneously records the citation as having two candidate editions, even when those two candidates are the same.

PR to follow...

The text was updated successfully, but these errors were encountered:

mattdahl · 2023-01-19T20:21:46Z

After closer review, it appears that the ending comma in the string (e.g., 574 U. S. ___,) is essential for this bug. I have not fully investigated the root cause, but I think that this may be because the closing comma triggers a match with an additional regex, e.g., the short form regex. (The key to this bug is that the citation token is matched by multiple regexes. Which is not itself a problem, but it currently causes the buggy downstream behavior.)

…awproject#137.

mattdahl added a commit to mattdahl/eyecite that referenced this issue Jan 19, 2023

test(find): Adds failing test for missing page bug described in freel…

9c7194a

…awproject#137.

mattdahl added a commit to mattdahl/eyecite that referenced this issue Jan 19, 2023

test(find): Adds failing test for missing page bug described in freel…

80373ef

…awproject#137.

mattdahl mentioned this issue Jan 19, 2023

Issue 137 - Fix Hyperscan duplicate editions bug #138

Merged

mattdahl closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperscanTokenizer erroneously introduces duplicate reporter edition candidates for citations without page numbers #137

HyperscanTokenizer erroneously introduces duplicate reporter edition candidates for citations without page numbers #137

mattdahl commented Jan 19, 2023

mattdahl commented Jan 19, 2023

HyperscanTokenizer erroneously introduces duplicate reporter edition candidates for citations without page numbers #137

HyperscanTokenizer erroneously introduces duplicate reporter edition candidates for citations without page numbers #137

Comments

mattdahl commented Jan 19, 2023

mattdahl commented Jan 19, 2023