Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list dependency for an apparent appositive #536

Closed
AngledLuffa opened this issue Jun 25, 2024 · 7 comments
Closed

list dependency for an apparent appositive #536

AngledLuffa opened this issue Jun 25, 2024 · 7 comments

Comments

@AngledLuffa
Copy link
Contributor

# sent_id = email-enronsent17_01-0014
# text = 7. Stephen Covey, author, The Seven Habits of Highly Effective People
1       7       7       NUM     LS      NumForm=Digit|NumType=Card      3       discourse       3:discourse     SpaceAfter=No
2       .       .       PUNCT   .       _       1       punct   1:punct _
3       Stephen Stephen PROPN   NNP     Number=Sing     0       root    0:root  _
4       Covey   Covey   PROPN   NNP     Number=Sing     3       flat    3:flat  SpaceAfter=No
5       ,       ,       PUNCT   ,       _       6       punct   6:punct _
6       author  author  NOUN    NN      Number=Sing     3       list    3:list  SpaceAfter=No    <----
@nschneid
Copy link
Contributor

The document has a bunch of "NAME, JOB TITLE" combos. I'm not sure if appos works because it requires the two nominals to be reversible.

@nschneid
Copy link
Contributor

The relevant part of the document:

Here are the top ten most requested eSpeakers.
10. Jack Welch, CEO, General Motors
9. Scott McNeally, CEO, Sun Microsystems
8. Satisfied Enron Customers
7. Stephen Covey, author, The Seven Habits of Highly Effective People
6. Oprah Winfrey, talkshow host

and so on.

I think list is defensible here. These are not really sentences, but structured data with values separated by commas

@amir-zeldes
Copy link
Contributor

Why is "7." tokenized apart? It doesn't actually contain a period, right? I thought it was just a list marker as a whole.

@nschneid
Copy link
Contributor

Punctuations in list item markers are tokenized off in EWT.

@amir-zeldes
Copy link
Contributor

Hm, not sure if we have the energy to standardize this, but it does seem jarring to me, since it really doesn't mean anything. In ON they are mostly untokenized, though I see there are quite a few exceptions. GUM-style corpora are 100% untokenized as well.

@nschneid
Copy link
Contributor

moved tokenization discussion to a new issue: #543

The question for this issue is whether we need to change list to appos. I don't see a clear justification for that.

@amir-zeldes
Copy link
Contributor

The question for this issue is whether we need to change list to appos. I don't see a clear justification for that.

Oh, certainly, wasn't trying to argue about that, I just noticed the LS thing. Thanks for opening the other issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants