Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved quote detection and attribution #382

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

afriedman412
Copy link

Description

  • Pairwise, incremental quote detection looks for specific pairs of characters, no longer requires an even number of quotation marks to work.
  • Attribution window expanded and adjusted to improve accuracy and prevent some false positives.
  • Code added to prep/standardize text for quote detection

Motivation and Context

This is part of a larger project to create a package to combine quote detection and attribution with coreference resolution, which will be used for the analysis of several thousand newspaper articles.

How Has This Been Tested?

A/B testing with random samples of said articles, test creation after major changes.

(New tests added as well.)

Copy link
Collaborator

@bdewilde bdewilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @afriedman412 , thanks for your patience on this. i've left comments requesting a handful of minor changes. there's also a consistent formatting issue that makes diffing a bit hard; could you run black over the changed modules, so that the code formatting is standard / consistent with the rest of textacy?

)
from spacy.tokens import Doc, Span, Token
import regex as re
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textacy doesn't currently have regex as a dependency. is it possible to use the stdlib re module here instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably? I've had some issues using re when working with finicky regular expressions, so I kind of use it by default now but I can test it.

@@ -9,12 +9,12 @@

import collections
from operator import attrgetter
from typing import Iterable, Mapping, Optional, Pattern
from typing import Iterable, Mapping, Optional, Pattern, Literal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from typing import Iterable, Mapping, Optional, Pattern, Literal
from typing import Iterable, Literal, Mapping, Optional, Pattern

content = doc[qtok_start_idx : qtok_end_idx + 1]
# pairs up quotation-like characters based on acceptable start/end combos
# see constants for more info
qtoks = [tok for tok in doc if tok.is_quote or (re.match(r"\n", tok.text))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we consider tokens with "\n" in them to be quotation-like?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some formatting dictates that if you start a new paragraph while quoting someone, you start the next paragraph with a quotation mark even though the original quotation mark is never closed. in that case the linebreak functions as a closing quotation mark.

i actually added a test for it but it's actually not a great example -- I'll find a better one.

@@ -27,9 +27,10 @@
nsubjpass,
obj,
pobj,
xcomp,
xcomp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comma was here for a reason -- black put it there automatically :)

Suggested change
xcomp
xcomp,

@@ -21,6 +21,51 @@
OBJ_DEPS: set[str] = {"attr", "dobj", "dative", "oprd"}
AUX_DEPS: set[str] = {"aux", "auxpass", "neg"}

MIN_QUOTE_LENGTH: int=4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure this is a "constant" value, seems more like a parameter with a default value that should go in the direct_quotations extraction function. what do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that makes sense

src/textacy/extract/triples.py Show resolved Hide resolved
and q.i > qtok_idx_pairs[-1][1]
):
for q_ in qtoks[n+1:]:
if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we store -- and compare against -- the ord values instead of just the "raw" text quotation marks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less room for error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate on that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are lots of ways something that is supposed to be a raw text quotation mark gets tokenized incorrectly, when you start dealing with encoding, decoding, pulling text from html, escape character issues, the whitespace_ issue above, etc etc etc. as the edge cases piled up, I realized the ord value was consistent no matter what. so I decided to use that instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. i think at a later point i may revisit some of this logic, under the assumption that the user has dealt with bad text encodings, etc. before attempting quote detection. but it's probably fine for now :)

@@ -305,15 +304,105 @@ def expand_noun(tok: Token) -> list[Token]:
child
for tc in tok_and_conjuncts
for child in tc.children
# TODO: why doesn't compound import from spacy.symbols?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering why this line was deleted? it's a comment, for me! :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought it was my comment lol

Comment on lines 95 to 98
),
)
],
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like your code editor is making some spurious changes (here and elsewhere) and that aren't PEP-compliant / black-enforced. we'll want to fix these before any merge.

)
]
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants