Skip to content
This repository has been archived by the owner on Nov 16, 2020. It is now read-only.

ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277

Open
kahlep opened this issue Apr 15, 2019 · 1 comment
Open
Labels

Comments

@kahlep
Copy link
Contributor

kahlep commented Apr 15, 2019

CITlab HTR(+) training will drop all lines that contain unicode surrogates (see Transkribus/TranskribusAppServerModules#59). For each line a JobError is stored and is shown in the job overview.

User should be warned about this restriction when such a character is entered in the transcription widget (copy-paste?) and possibly via the virtual keyboard (if it allows to map surrogate chars).
The check can be done with Character.isSurrogate(char ch).

@kahlep kahlep added the feature label Apr 15, 2019
@kahlep
Copy link
Contributor Author

kahlep commented Jul 11, 2019

This issue also affects other Unicode categories besides surrogates, e.g. the "unassigned" category.

The CITlabTokenizer 1.0 used the categerization in Java, e.g. Character.isSurrogate(char ch), which implements the Unicode 6.2 specification.
CITlabTokenizer 1.1.0 relies on an internal lookup from a text file to support Unicode 12.1 where character assignements were added but also changed in some cases.

The tokenizer 1.1.0 is now included as dependency with TranskribusCore and can be used to do the checks described initially: de.uros.citlab.tokenizer.categorizer.CategorizerWordMergeGroups::getCategory(char c) will throw a RuntimeException on illegal chars.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant