-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specification of the format for the values of the @unit attribute #39
Comments
We need to use the Then if a unit is missing, it has currently to be declared in |
We could actually verify the |
Regarding the Btw we would need to update also |
Actually we need to clarify/remember what we wanted to do with this attribute's value. I think it was introduced to disambiguate the annotated unit, so normally the value here should correspond to the one use in the text - so 'minute' makes sense when minute is annotated, and We could then use this annotation for the unit parser, its training and evaluation I think. |
I agree with you on the usage, however if use the value which is in the text, what's the point of having then the attribute at all? If we want to use this for the unit parser, then is better to have an already (semi)normalised version:
What do you think? |
The attribute is used to build training data, but its value is not used for the moment. I would agree with you, the value could be - not really a "normalized" but - a "valid" form of the unit as it appears which might be degraded because it comes from the PDF, some example:
(these cleaning/"transliterations" are not always obvious) The advantage is that we stick on what appears in the text, and the annotator does not need to use a reference list of units and normal forms. We could also choose a more advanced/normalized form (like put I will look again at the unit parser, which most likely require a new iteration, and see what kind of information could be most needed. |
To know what form a @Unit attribute's value must take (for example
unit="min"
orunit="minute"
?), we can use this page http://cdsarc.u-strasbg.fr/cgi-bin/Unit?%3fIs that ok?
Are there other sources we could use?
edit: the page mentioned above is not always the answer since there are sometimes several symbols for a unit (ex.
year
may bea
oryr
, and in grobid-quantities it comes out asunit="year"
).How should we proceed?
The text was updated successfully, but these errors were encountered: