Tesseract language training Windows GUI v5.12 for Tesseract 4+. Both Windows executable and source AutoHotKey script files are provided.
Please don't forget this fork is for Windows GUI implementation developed by only one developer (so far). The Tesseract and Tesstrain projects for which this GUI is created are much larger open source projects.
The GUI executable is portable. You can copy the tesstrain_gui.exe
file to any directory and execute it.
You will need version 4 or newer of Tesseract executables (that include the training tools and matching leptonica bindings).
I recommend downloading executables from the Tesseract at UB Mannheim repository.
You will also need a copy of 'traineddata' which you can find for example on the official Tesseract website. Make sure you will download the model marked as 'best' if you want to use it as a 'Start model' for your new model (the 'fast' one cannot be used as a 'Start model').
If you prefer, you can also build and install binaries on your own. More information can be found in the Tesseract User Manual.
You need a recent version of Python 3.x. For image processing the Python library Pillow
is used.
If you don't have a global installation, the GUI will try to install Pillow
and other required Python modules on
the first run.
'python' or 'python3' command must be working from the project's directory (Python's executable folder should be
in your PATH environment variable).
Tesseract expects some configuration data (a file radical-stroke.txt
). It will be downloaded automatically by the GUI when needed from this address and placed in the configurable "Output data directory".
Tesstrain GUI will ask you for a name for your model. By convention, Tesseract stack models including
language-specific resources use (lowercase) three-letter codes defined in
ISO 639 with additional
information separated by underscore. E.g., chi_tra_vert
for traditional
Chinese with vertical typesetting. Language-independent (i.e. script-specific)
models use the capitalized name of the script type as identifier. E.g.,
Hangul_vert
for Hangul script with vertical typesetting. In the following,
the model name is referenced by MODEL_NAME
.
Place ground truth consisting of line images and transcriptions in a folder of your choice (default:
data/MODEL_NAME-ground-truth
). GUI will generate list of those files, and split into training and evaluation data, the ratio can be defined in the GUI.
Images must be in .tif
, .png
, .bin.png
, .nrm.png
or .bmp
format.
Transcriptions must be single-line plain text and have the same name as the
line image but with the image extension replaced by .gt.txt
. If any supported
image file won't have corresponding .gt.txt
file, you will be asked for content on the
start of training, and it will be saved in a proper file.
The repository contains a folder with sample ground truth, see ocrd-testset.
NOTE: If you want to generate line images for transcription from a full page, see tips in issue 7 and in particular @Shreeshrii's shell script.
Execute the tesstrain_gui.exe
and follow the displayed instructions.
Software is provided under the terms of the Apache 2.0
license.
Sample training data provided by Deutsches Textarchiv is in the public domain.