Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker/install: build Tesseract from source #197

Merged
merged 13 commits into from
Feb 14, 2024

Conversation

joschrew
Copy link
Contributor

@joschrew joschrew commented Jan 31, 2024

This PR is part of series to offer single ocrd modules as Docker Containers (ocrd slim containers) to be used with ocr-d network.

This Dockerfile currently doesn't work in all cases and it still needs updates. I created the PR anyway because I use/need it for my tests. EDIT now works. (This basically migrates all the install-tesseract rules from ocrd_all's makefile here, where it actually belongs.)

My idea was to maybe create the tesseract Container with ocrd_all:

cd ocrd_all
git submodule update --init tesserocr/ core/ tesseract/ ocrd_tesserocr/
docker build --build-arg="OCRD_MODULES=core ocrd_tesserocr tesseract tesserocr " --no-cache -t my-ocrd-slim-container .

Copy link

codecov bot commented Jan 31, 2024

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

@stweil
Copy link
Contributor

stweil commented Feb 6, 2024

I wonder whether there are still reasons for building the tesseract binary.

Using the package from a recent Linux distribution is simpler and would save significant build time.

Another possible approach would also work for tesserocr and some more parts of OCR-D: OCR-D could use its own package repositories for all parts with simple dependencies.

@bertsky
Copy link
Collaborator

bertsky commented Feb 6, 2024

I wonder whether there are still reasons for building the tesseract binary.

Using the package from a recent Linux distribution is simpler and would save significant build time.

Because most of the time, we cannot use Tesseract from a Linux distribution: our base distro is usually older than the current one, and we have no control over Tesseract features that we actually need. The same goes for PPA.

We had good reasons to pin to a specific Tesseract version via source build in subrepo. No reason to give that up now.

Another possible approach would also work for tesserocr and some more parts of OCR-D: OCR-D could use its own package repositories for all parts with simple dependencies.

Much simpler: conda

@joschrew
Copy link
Contributor Author

joschrew commented Feb 8, 2024

@kba: Your changes resolved all my erros with my test workspace. I added a resmgr call to the dockerimage to add eng traineddata. I get an error when trying to process without it.

Edit: Maybe equ.traineddata and osd.traineddata should be added as well, I am not sure

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that this is working now.

Some cosmetic change requests below. Adapting CircleCI config should follow.

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
@bertsky
Copy link
Collaborator

bertsky commented Feb 9, 2024

Adapting CircleCI config should follow.

In fact, since it already seems broken on master – unfortunately CircleCI does not keep the logs long enough, but I guess it's about the TESSDATA_PREFIX / resmgr location – we should fix this here.

So I suggest (after rewriting deps-ubuntu as proposed above) to update the CircleCI config to do make install-tesseract install-tesserocr before make install.

@bertsky
Copy link
Collaborator

bertsky commented Feb 12, 2024

In fact, since it already seems broken on master – unfortunately CircleCI does not keep the logs long enough, but I guess it's about the TESSDATA_PREFIX / resmgr location – we should fix this here.

So I suggest (after rewriting deps-ubuntu as proposed above) to update the CircleCI config to do make install-tesseract install-tesserocr before make install.

Now the CI config definitely needs make install-tesseract install-tesserocr. Also, we must drop the chmod workaround (for which there is no need anymore).

@bertsky
Copy link
Collaborator

bertsky commented Feb 13, 2024

Now the CI config definitely needs make install-tesseract install-tesserocr. Also, we must drop the chmod workaround (for which there is no need anymore).

@joschrew do you want me to make that change (on your fork's writable branch)?

make deps-ubuntu no longer fetches Tesseract via PPA, so we need to make install-tesseract

also, drop unsupported Python 3.6
(since normal Circleci `checkout` creates empty submodule directories)
using VIRTUAL_ENV from PYENV_ROOT
@bertsky bertsky self-requested a review February 14, 2024 10:53
@bertsky bertsky marked this pull request as ready for review February 14, 2024 10:53
Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At last!

@bertsky bertsky requested a review from kba February 14, 2024 10:54
@bertsky bertsky changed the title Update dockerfile docker/install: build Tesseract from source Feb 14, 2024
@bertsky
Copy link
Collaborator

bertsky commented Feb 14, 2024

Oh, maybe we should also migrate make install tesseract-training here? (Once we remove these rules from ocrd_all, there would be no more way to compile lstmtraining, combine_tessdata etc.)

@bertsky bertsky merged commit bf29777 into OCR-D:master Feb 14, 2024
3 checks passed
@bertsky bertsky mentioned this pull request Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants