-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved quote detection and attribution #382
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @afriedman412 , thanks for your patience on this. i've left comments requesting a handful of minor changes. there's also a consistent formatting issue that makes diffing a bit hard; could you run black
over the changed modules, so that the code formatting is standard / consistent with the rest of textacy?
src/textacy/extract/triples.py
Outdated
) | ||
from spacy.tokens import Doc, Span, Token | ||
import regex as re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
textacy
doesn't currently have regex
as a dependency. is it possible to use the stdlib re
module here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably? I've had some issues using re
when working with finicky regular expressions, so I kind of use it by default now but I can test it.
@@ -9,12 +9,12 @@ | |||
|
|||
import collections | |||
from operator import attrgetter | |||
from typing import Iterable, Mapping, Optional, Pattern | |||
from typing import Iterable, Mapping, Optional, Pattern, Literal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import Iterable, Mapping, Optional, Pattern, Literal | |
from typing import Iterable, Literal, Mapping, Optional, Pattern |
content = doc[qtok_start_idx : qtok_end_idx + 1] | ||
# pairs up quotation-like characters based on acceptable start/end combos | ||
# see constants for more info | ||
qtoks = [tok for tok in doc if tok.is_quote or (re.match(r"\n", tok.text))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we consider tokens with "\n"
in them to be quotation-like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some formatting dictates that if you start a new paragraph while quoting someone, you start the next paragraph with a quotation mark even though the original quotation mark is never closed. in that case the linebreak functions as a closing quotation mark.
i actually added a test for it but it's actually not a great example -- I'll find a better one.
src/textacy/extract/triples.py
Outdated
@@ -27,9 +27,10 @@ | |||
nsubjpass, | |||
obj, | |||
pobj, | |||
xcomp, | |||
xcomp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comma was here for a reason -- black
put it there automatically :)
xcomp | |
xcomp, |
src/textacy/constants.py
Outdated
@@ -21,6 +21,51 @@ | |||
OBJ_DEPS: set[str] = {"attr", "dobj", "dative", "oprd"} | |||
AUX_DEPS: set[str] = {"aux", "auxpass", "neg"} | |||
|
|||
MIN_QUOTE_LENGTH: int=4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure this is a "constant" value, seems more like a parameter with a default value that should go in the direct_quotations
extraction function. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes sense
and q.i > qtok_idx_pairs[-1][1] | ||
): | ||
for q_ in qtoks[n+1:]: | ||
if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we store -- and compare against -- the ord
values instead of just the "raw" text quotation marks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less room for error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you elaborate on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are lots of ways something that is supposed to be a raw text quotation mark gets tokenized incorrectly, when you start dealing with encoding, decoding, pulling text from html, escape character issues, the whitespace_
issue above, etc etc etc. as the edge cases piled up, I realized the ord
value was consistent no matter what. so I decided to use that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. i think at a later point i may revisit some of this logic, under the assumption that the user has dealt with bad text encodings, etc. before attempting quote detection. but it's probably fine for now :)
@@ -305,15 +304,105 @@ def expand_noun(tok: Token) -> list[Token]: | |||
child | |||
for tc in tok_and_conjuncts | |||
for child in tc.children | |||
# TODO: why doesn't compound import from spacy.symbols? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering why this line was deleted? it's a comment, for me! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought it was my comment lol
tests/extract/test_triples.py
Outdated
), | ||
) | ||
], | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like your code editor is making some spurious changes (here and elsewhere) and that aren't PEP-compliant / black-enforced. we'll want to fix these before any merge.
tests/extract/test_triples.py
Outdated
) | ||
] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes
Description
Motivation and Context
This is part of a larger project to create a package to combine quote detection and attribution with coreference resolution, which will be used for the analysis of several thousand newspaper articles.
How Has This Been Tested?
A/B testing with random samples of said articles, test creation after major changes.
(New tests added as well.)