Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nounphrase_consolidate() ? #134

Closed
kbenoit opened this issue Nov 3, 2018 · 1 comment
Closed

Add nounphrase_consolidate() ? #134

kbenoit opened this issue Nov 3, 2018 · 1 comment

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 3, 2018

PR #119 adds the ability to extract noun phrases. Should we allow this list of noun phrases to be applied to a spacy_parsed object to be consolidated the same as entity_consolidate()? Seems like it would make sense, and provide an alternative to entity_consolidate().

We could call this nounphrase_consolidate(). It would provide this workflow:

  1. sp <- spacy_parse() on input text to return a data.frame, classed as spacyr_parsed.
  2. np <- spacy_extract_nounphrases() on the input text to get the noun phrases.
  3. nounphrase_consolidate(x = sp, nounphrases = np) to turn the noun phrases into something that looks like a spacyr_parsed object but with the noun phrase sequences combined in a way similar to the operation of entity_consolidate().

Or: Is there a (more efficient) way to do this in one step, when calling spacy_parse()?

@amatsuo
Copy link
Collaborator

amatsuo commented Nov 6, 2018

I have implemented nounphrase_extract() and nounphrase_consolidate(). See
https://github.com/quanteda/spacyr/blob/noun-phrase-v2/tests/misc/test_nounphrase_extraction.html

To have the same functionality as entity_*, nounphrase_consolidate() works independently from nounphrase_extract()`.

To try out, please install the noun-phrase-v2 branch.

There are a few points to do/consider:

  • tests
  • we need to make sure that once one of entlity_consolidate() or nounphrase_consolidte() is applied, the other method cannot be applied. My suggestions are:
    1. add another class to the output (entity_consolidated) or
    2. drop the field for the other (e.g. with nounphrase_consolidte() execution, entity field will be removed
  • I am not sure I have NULL-ified all relevant names used in the function with data.table objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants