Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract OCR -- downsample original TIFF before OCR if too large #2825

Open
jrochkind opened this issue Dec 17, 2024 · 0 comments
Open

Tesseract OCR -- downsample original TIFF before OCR if too large #2825

jrochkind opened this issue Dec 17, 2024 · 0 comments

Comments

@jrochkind
Copy link
Contributor

Too large images cause tesseract on our heroku infrastructure to crash by using too much RAM.

Many of these images are 600 dpi. What if we downsampled to 300 dpi first before OCR? That would prob very many of them within limits again. Heck, even if original is 300 dpi, a 150 dpi is probably still OCR'able.

Based on experience (See #2820), images start causing problems at around 200MB or around 70 million pixels (height x width > 70 million).

The pixel boundary is probably more reliable. If more than 70 million pixels, downsample by 50% before OCR'ing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant