Tesseract OCR -- downsample original TIFF before OCR if too large #2825

jrochkind · 2024-12-17T16:44:45Z

Too large images cause tesseract on our heroku infrastructure to crash by using too much RAM.

Many of these images are 600 dpi. What if we downsampled to 300 dpi first before OCR? That would prob very many of them within limits again. Heck, even if original is 300 dpi, a 150 dpi is probably still OCR'able.

Based on experience (See #2820), images start causing problems at around 200MB or around 70 million pixels (height x width > 70 million).

The pixel boundary is probably more reliable. If more than 70 million pixels, downsample by 50% before OCR'ing?

jrochkind added zenhub-inbox and removed zenhubinbox labels Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract OCR -- downsample original TIFF before OCR if too large #2825

Tesseract OCR -- downsample original TIFF before OCR if too large #2825

jrochkind commented Dec 17, 2024

Tesseract OCR -- downsample original TIFF before OCR if too large #2825

Tesseract OCR -- downsample original TIFF before OCR if too large #2825

Comments

jrochkind commented Dec 17, 2024