Source text

Train a printed-Hebrew Kraken model, according to the instructions here. The goal is to quickly test Kraken on Hebrew (i.e. not HTR of a manuscript). Since Kraken requires at least 800 lines transcribed text, and the process provided by Kraken is manual, an automation script was developed, to run all the preparations of the lines image-text pairs. After the training set is ready, the Kraken model training can be run.

Source text

The training text used is Genesis book, taken from Sefaria. The text is in Creative Commons license, taken from (http://www.tanach.us/Tanach.xml)

Requirements

On a Debian Linux (Ubuntu 16.04 works well), install Kraken along with the Python installation and Vagrant image required for the training. Instruction here. The script requires:

Beautiful Soup 4 Python module:

pip install beautifulsoup4

ImageMagic to convert text into images:

sudo apt-get install imagemagick

Training set preparation

Run the script src/train_kraken_print.py The script does the following:

Breaks the Genesis JSON file into chapters, and for each chapter:
- Extracts the verses into text file, verse per line.
- Generates an image, big enough for Kraken - 5000x8000px, using ImageMagic. Since ImageMagic doesn't seem to support LTR, the Hebrew words are reversed. It's simplistic, but works for pure-Hebrew text. For mixed text, bidirectional support is required.
- Runs the Kraken ketos utility, to generate the HTML file, intended for manual transcription
- Fills the HTML transcription file automatically, with the text from the matching text file.

Running the script generates HTML files of the transcribed images to ./work/ketos_filled_*.html, along with byproducts.

Extract lines image-text pairs

Now that the transcription file are ready, run the Kraken command: ketos extract --reorder --output training --normalization NFD work/ketos_filled_* The files will be generated into training/ directory.

Kraken training

Get and run the Kraken Vagrant virtual machine. Make sure to run Vagrant in the same directory in which the train_kraken_script.py script was run: vagrant init openphilology/kraken vagrant up vagrant ssh In the Kraken VM, run the training command: ./train.sh /vagrant/training heb_print It will take the training set, and train a model called heb_print.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
Genesis - he - Tanach with Text Only.json		Genesis - he - Tanach with Text Only.json
README.md		README.md
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Source text

Requirements

Beautiful Soup 4 Python module:

ImageMagic to convert text into images:

Training set preparation

Extract lines image-text pairs

Kraken training

About

Releases

Packages

Languages

uschor/training_kraken_hebrew_print

Folders and files

Latest commit

History

Repository files navigation

Source text

Requirements

Beautiful Soup 4 Python module:

ImageMagic to convert text into images:

Training set preparation

Extract lines image-text pairs

Kraken training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages