tokenization of list item markers #543

nschneid · 2024-07-15T18:19:36Z

The tokenization of markers like "3." and "(a)" is not consistent across English treebanks.

I think we've agreed to leave it alone (previous discussion), but for posterity, here is some info that I collected back when we were discussing the new policy for list item markers:

OntoNotes Tokenization: In newer OntoNotes documents, LS generally includes periods and hyphens after the letter/number (“1.”, “1-”). But in OntoNotes-WSJ (and PTB Revised), the LS is strictly the letter or number—no associated punctuation characters. Parentheses are always tokenized separately.
In UD-EWT, LS tokens are treated as NUM (even if they are letters), with all punctuation characters tokenized separately. This seems to follow the original Penn EWT & WSJ tokenization. I am reluctant to mess with UD-EWT tokenization because it produces misalignments with the Penn trees. (Also, “(“ as the superficial head of a goeswith would be awkward.)
In UD-GUM and GENTLE, LS tokens are always tokenized in a single token, including parentheses and any other characters.

amir-zeldes · 2024-07-15T18:33:33Z

Thanks for the synopsis, that's helpful!

I'm not very attached to the X upos, though I do find NUM for things like "A" a bit strange. Regarding tokenization, it seems odd to split up the existing tokens only to connect them again with a trivial relation, and I don't think it makes sense to say that "1." contains a separate period token - for me "1" and "1." are basically interchangeable. I will add that the "1." cases are much more numerous in ON than the two token cases, so presumably tokenizers trained on ON favor not splitting. By contrast, splitting brackets is indeed the preference in ON, which I don't love because you can get unmatched brackets for things like "1)"... It also doesn't gel well with the strong ON preference not to split periods IMO.

In any case, even if we don't change any of the English datasets, it might be nice to get a statement of principle from UD about what is the core group's recommendation for new languages.

nschneid mentioned this issue Jul 15, 2024

list dependency for an apparent appositive #536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization of list item markers #543

tokenization of list item markers #543

nschneid commented Jul 15, 2024 •

edited

Loading

amir-zeldes commented Jul 15, 2024

tokenization of list item markers #543

tokenization of list item markers #543

Comments

nschneid commented Jul 15, 2024 • edited Loading

amir-zeldes commented Jul 15, 2024

nschneid commented Jul 15, 2024 •

edited

Loading