Skip to content

Latest commit

 

History

History
9 lines (7 loc) · 784 Bytes

README.md

File metadata and controls

9 lines (7 loc) · 784 Bytes

ocr_errors_simulator

Functions and codes used to determine probabilities on OCR errors and simulate them

For the charset, use JSONL_reading.py to preprocess the ecco file (creation of different files to create chunks of the compressed data). Then, use charset.py to create the charset that will be in a file text. For the JSONL file for probabilities, use CSV_convert.py to preprocess the CSV files to fit with the following Python file, OCR_errors_JSON_generator.py. If everything has been done correctly, use OCR_noise.py to create OCR noise.

To use OCR_noise.py, please use the following command:

python3 OCR_noise.py {--seed 0} --charset data_files/ecco_charset.txt --charset-probs data_files/ecco_i/ecco_i_probs.jsonl {clean/texts/path.jsonl} > {noised/texts/path.jsonl}