-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumType for a few random cases #462
Comments
This looks to me like more evidence that maybe we don't need |
@amir-zeldes @rhdunn thoughts on this? I believe GUM has |
These are not all strictly cardinal numbers, which are defined as the counting/natural numbers [1] [2]. The ordinal numbers are equivalent for positions/ordering. We have 3 general types/groups here:
Note also that one of the treebanks -- I can't recall which -- has a case of It would be helpful if the data could differentiate these types of number. They are separate morphological features. For example, the cardinals would have a lemma that removes the dots and commas from their form, and the fractional and section numbers don't as in the fractional case the dot is important. With the section case, I've also been meaning to raise an issue around the grouping of single letter abbreviations, like in [1] https://www.merriam-webster.com/dictionary/cardinal%20number |
https://github.com/UniversalDependencies/docs/issues would be a good place for discussion of the inventory of Tokenization tends to be language-specific. In general, for English, I would expect to separate things that are either (1) sentence-organizing punctuation (commas, quotation marks, etc.), (2) clitics, (3) hyphenated linguistic words, or (4) units often/usually written with a space between them. (Or other things where there's a well-established history of separating them in tokenizers, such as currency symbols with numbers.) Of course there will be difficult cases, but in general I do not see tokenization or UD syntax as a way to express the full "grammar" of subsystems like numerical dates or numerical section-subsection notation. |
It's a long-standing UD feature, so I would keep it. I don't think it's very difficult to recognize in practice. Even done fully automatically it would have fewer errors than many other things we have going on. |
@amir-zeldes are you saying "713.853.3102" as a telephone number and "5.1" as a subsection number should be |
Yes, that sounds right to me. If UD has a "Frac" number type, then it should apply only to things that are actually fractions. Section numbers can have even more hierarchy, and I think we'd all agree that "5.1.1" is not a fraction. It's a coincidence that section numbers can be homographs of fractions, but there are all sorts of homographs out there that have to be tagged differently, and we still stick to the same basic distinctions the tagset makes, so I don't see why this would be different. |
This is written with European style
.
between thousands, which is different from the rest of EWT. Generally commas are removed in the lemmas, so I suppose this should have a lemma of10000000
Phone numbers are unusual:
There's also section numbers...
whereas the change I just submitted to update
NumType=Frac
for a bunch of numbers changed section numbers with 2 numbers toFrac
:so I think there's some room for improvement there
The text was updated successfully, but these errors were encountered: