You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 16, 2020. It is now read-only.
CITlab HTR(+) training will drop all lines that contain unicode surrogates (see Transkribus/TranskribusAppServerModules#59). For each line a JobError is stored and is shown in the job overview.
User should be warned about this restriction when such a character is entered in the transcription widget (copy-paste?) and possibly via the virtual keyboard (if it allows to map surrogate chars). The check can be done with Character.isSurrogate(char ch).
The text was updated successfully, but these errors were encountered:
This issue also affects other Unicode categories besides surrogates, e.g. the "unassigned" category.
The CITlabTokenizer 1.0 used the categerization in Java, e.g. Character.isSurrogate(char ch), which implements the Unicode 6.2 specification.
CITlabTokenizer 1.1.0 relies on an internal lookup from a text file to support Unicode 12.1 where character assignements were added but also changed in some cases.
The tokenizer 1.1.0 is now included as dependency with TranskribusCore and can be used to do the checks described initially: de.uros.citlab.tokenizer.categorizer.CategorizerWordMergeGroups::getCategory(char c) will throw a RuntimeException on illegal chars.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
CITlab HTR(+) training will drop all lines that contain unicode surrogates (see Transkribus/TranskribusAppServerModules#59). For each line a JobError is stored and is shown in the job overview.
User should be warned about this restriction when such a character is entered in the transcription widget (copy-paste?) and possibly via the virtual keyboard (if it allows to map surrogate chars).
The check can be done withCharacter.isSurrogate(char ch)
.The text was updated successfully, but these errors were encountered: