ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277

kahlep · 2019-04-15T07:36:06Z

CITlab HTR(+) training will drop all lines that contain unicode surrogates (see Transkribus/TranskribusAppServerModules#59). For each line a JobError is stored and is shown in the job overview.

User should be warned about this restriction when such a character is entered in the transcription widget (copy-paste?) and possibly via the virtual keyboard (if it allows to map surrogate chars).
~~The check can be done with Character.isSurrogate(char ch).~~

The text was updated successfully, but these errors were encountered:

kahlep · 2019-07-11T11:23:28Z

This issue also affects other Unicode categories besides surrogates, e.g. the "unassigned" category.

The CITlabTokenizer 1.0 used the categerization in Java, e.g. Character.isSurrogate(char ch), which implements the Unicode 6.2 specification.
CITlabTokenizer 1.1.0 relies on an internal lookup from a text file to support Unicode 12.1 where character assignements were added but also changed in some cases.

The tokenizer 1.1.0 is now included as dependency with TranskribusCore and can be used to do the checks described initially: de.uros.citlab.tokenizer.categorizer.CategorizerWordMergeGroups::getCategory(char c) will throw a RuntimeException on illegal chars.

kahlep added the feature label Apr 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277

ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277

kahlep commented Apr 15, 2019 •

edited

Loading

kahlep commented Jul 11, 2019

ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277

ATranscriptionWidget & Virtual Keyboard: warn on input of surrogate chars #277

Comments

kahlep commented Apr 15, 2019 • edited Loading

kahlep commented Jul 11, 2019

kahlep commented Apr 15, 2019 •

edited

Loading