You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I do OCR german fraktur newspapers. A pdf , size 1-2 GB, contains approx. 1000 pages. OCR is done overnight (i5-3570), that's ok. But the search+replace of 3 specific fraktur chars (fraktur-s, fraktur-hyphen and fraktur-log hyphen ) to the corresponding normal chars lasts eternally: after 4 hr I aborted the program.
It seems, that the used algorithm isnt efficient enough for long files, and could/should be improved. BTW, if I use a bypass (exporting the hocr, replacing the 3 chars e.g. by a perl script, and re-importing the corrected hocr), this is done in minutes.
BTW: any chance to implement a pdf export format, in which the image format remains unchanged? Any format change results in much larger files.
The text was updated successfully, but these errors were encountered:
I did some tests trying to workaround the problem. This was possible by first exporting the hocr file, running a perl script doing the search&replace in the hocr file, and then re-importing the resulting hocr file. These procedures just needed a very small fraction of the time compared with the existing routine for search+replace, AND the exported pdf contained the replaced chars/strings!
I didn't check which algorithm is implemented in gImageReader. But, at least for huge numbers of images, IMHO a algorithm as described above, I'd suggest for trying to implement.
BTW, this workaround shows, that gImageReader is - obviously thx to the podofo library - already able to create readable pdfs from a hocr file and images - a task many programs fail to do, eg hocr-tools and many other. I'd suggest to promote this feature after thorough testing.
I do OCR german fraktur newspapers. A pdf , size 1-2 GB, contains approx. 1000 pages. OCR is done overnight (i5-3570), that's ok. But the search+replace of 3 specific fraktur chars (fraktur-s, fraktur-hyphen and fraktur-log hyphen ) to the corresponding normal chars lasts eternally: after 4 hr I aborted the program.
It seems, that the used algorithm isnt efficient enough for long files, and could/should be improved. BTW, if I use a bypass (exporting the hocr, replacing the 3 chars e.g. by a perl script, and re-importing the corrected hocr), this is done in minutes.
BTW: any chance to implement a pdf export format, in which the image format remains unchanged? Any format change results in much larger files.
The text was updated successfully, but these errors were encountered: