Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization of list item markers #543

Open
nschneid opened this issue Jul 15, 2024 · 1 comment
Open

tokenization of list item markers #543

nschneid opened this issue Jul 15, 2024 · 1 comment

Comments

@nschneid
Copy link
Contributor

nschneid commented Jul 15, 2024

The tokenization of markers like "3." and "(a)" is not consistent across English treebanks.

I think we've agreed to leave it alone (previous discussion), but for posterity, here is some info that I collected back when we were discussing the new policy for list item markers:

  • OntoNotes Tokenization: In newer OntoNotes documents, LS generally includes periods and hyphens after the letter/number (“1.”, “1-”). But in OntoNotes-WSJ (and PTB Revised), the LS is strictly the letter or number—no associated punctuation characters. Parentheses are always tokenized separately.
  • In UD-EWT, LS tokens are treated as NUM (even if they are letters), with all punctuation characters tokenized separately. This seems to follow the original Penn EWT & WSJ tokenization. I am reluctant to mess with UD-EWT tokenization because it produces misalignments with the Penn trees. (Also, “(“ as the superficial head of a goeswith would be awkward.)
  • In UD-GUM and GENTLE, LS tokens are always tokenized in a single token, including parentheses and any other characters.
@amir-zeldes
Copy link
Contributor

Thanks for the synopsis, that's helpful!

I'm not very attached to the X upos, though I do find NUM for things like "A" a bit strange. Regarding tokenization, it seems odd to split up the existing tokens only to connect them again with a trivial relation, and I don't think it makes sense to say that "1." contains a separate period token - for me "1" and "1." are basically interchangeable. I will add that the "1." cases are much more numerous in ON than the two token cases, so presumably tokenizers trained on ON favor not splitting. By contrast, splitting brackets is indeed the preference in ON, which I don't love because you can get unmatched brackets for things like "1)"... It also doesn't gel well with the strong ON preference not to split periods IMO.

In any case, even if we don't change any of the English datasets, it might be nice to get a statement of principle from UD about what is the core group's recommendation for new languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants