Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typo=Yes should not apply just because of an extra space #39

Open
nschneid opened this issue Nov 4, 2021 · 9 comments
Open

Typo=Yes should not apply just because of an extra space #39

nschneid opened this issue Nov 4, 2021 · 9 comments

Comments

@nschneid
Copy link

nschneid commented Nov 4, 2021

I believe 4 of these 5 instances are incorrect because the extra space is already accounted for by goeswith.

@amir-zeldes
Copy link
Contributor

Is that a universal guideline? Personally I would consider extra spaces to be a kind of Typo, and it also explains why the tokens around a goeswith are mangled. If I were searching for all typos in a corpus I think I'd want to find these too. Is there anything that speaks against having goeswith as well as Typo for these tokens?

@nschneid
Copy link
Author

nschneid commented Nov 4, 2021

The guidelines on typos. I think the function of Typo=Yes is to signal that some part of the tokenized wordform contains incorrect, incorrectly ordered, or missing non-space characters. In English we have things like "any were" for "anywhere", so it applies on the second "word" but not the first.

@nschneid
Copy link
Author

nschneid commented Nov 4, 2021

If I were searching for all typos in a corpus I think I'd want to find these too.

Missing spaces are also typos in a broader sense but not flagged with Typo=Yes (Which word would that be on? Both? What if it is a missing space before or after punctuation? etc.). The convention is to use CorrectSpaceAfter=Yes|SpaceAfter=No.

@amir-zeldes
Copy link
Contributor

OK, it's not my intuition but I can live with it either way.

@amir-zeldes
Copy link
Contributor

So, looking at this more closely, would you say in this example:

  • based on it is price (for "its")

The first token should still carry Typo=Yes, no? Or do we assume that "goeswith" covers the statement "this is broken"? If not, then we can't just tell the validator to disallow Typo on goeswith.

@nschneid
Copy link
Author

nschneid commented Nov 5, 2021

I would put CorrectForm=s|Typo=Yes on the second token because its form is "is" rather than "s". In cases where the two forms connected by goeswith concatenate to the correctly spelled word, no Typo or CorrectForm feature.

@amir-zeldes
Copy link
Contributor

I see what you're saying, but I thought in goeswith, the second token basically doesn't exist in terms of features, so I would have expected token1 to carry the Typo: goeswith is saying "lose the space", and the first token then carries everything the merged token has to say, including "'itis' is a typo for 'its'"

@nschneid
Copy link
Author

nschneid commented Nov 5, 2021

What the docs say:

The head should also bear the part-of-speech tag and morphological annotation of the entire word. It is not necessary to add the Typo feature and CorrectForm in MISC, unless there is a “normal” typo too, i.e. if simple concatenation of the parts does not yield the correct form. Example:

I guess it leaves ambiguous where the Typo/CorrectForm should be if there is a normal typo too.

Maybe I should draft a document spelling out the formal constraints at play and pseudocode for producing a canonical representation without typos/misuse of spaces/repair.

@amir-zeldes
Copy link
Contributor

Forgot to answer: yes, that would be great! Upon thinking about it, I think I'd prefer for typo to be on the first token in goeswith, since the second/third/subsequent parts of a broken token don't really have an 'expected' spelling IMO - it's the merged token which has a standard spelling, and that is what is being deviated from (plus it's easier to say anything with deprel goeswith can't have Typo, or any other meaningful FEAT really)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants