Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading Dangerzone 0.5.0 to 0.6.0 on Fedora 38 may / will break the OCR component #737

Closed
deeplow opened this issue Mar 4, 2024 · 3 comments · Fixed by #741
Closed

Comments

@deeplow
Copy link
Contributor

deeplow commented Mar 4, 2024

Upgrading Dangerzone 0.5.0 to 0.6.0 on Fedora 38 may / will break the OCR component. This can be easily fixed by appending the line

    export TESSDATA_PREFIX=/usr/share/tesseract/tessdata

to the file .bash_profile in the disposable template used for the Dangerzone dispVM.

Originally posted by @GWeck in #704 (comment)

@deeplow
Copy link
Contributor Author

deeplow commented Mar 4, 2024

Great catch! We had tested on Qubes a dev build for Fedora 38 templates, and a production build for Fedora 39 templates. And yet we missed it 😬 . The reason we missed it is:

  1. The PyMuPDF version on Fedora 39 is 1.23.3, which accepts the Tesseract data path as a separate argument. Our code checks for the PyMuPDF version and does pass the correct path:

    def get_tessdata_dir() -> str:
    if running_on_qubes():
    return "/usr/share/tesseract/tessdata/"
    else:
    return "/usr/share/tessdata/"

    if int(fitz.version[2]) >= 20230621000001:
    page_pdf_bytes = pixmap.pdfocr_tobytes(
    compress=True,
    language=ocr_lang,
    tessdata=get_tessdata_dir(),
    )
    else:
    # XXX method signature changed in v1.22.5 to add tessdata arg
    # TODO remove after oldest distro has PyMuPDF >= v1.22.5
    page_pdf_bytes = pixmap.pdfocr_tobytes(
    compress=True,
    language=ocr_lang,
    )

  2. The dev script on Qubes has this line, so that's why the issue did not manifest on our local tests:

    # XXX workaround lack of tessdata path arg for PyMuPDF < v1.22.5
    # for context see https://github.com/freedomofpress/dangerzone/issues/682
    os.environ["TESSDATA_PREFIX"] = os.environ.get("TESSDATA_PREFIX", "/usr/share/tesseract/tessdata")

Originally posted by @apyrgio in #704 (comment)

@deeplow deeplow mentioned this issue Mar 4, 2024
60 tasks
apyrgio added a commit that referenced this issue Mar 4, 2024
Provide a fix for an OCR bug that affected Fedora 38 templates of Qubes
OS. In that specific configuration, the PyMuPDF version accepts the
Tesseract data directory only from the `TESSDATA_PREFIX` environment
variable. Our mistake was that we were setting this environment variable
in a dev script, instead of setting it for all configurations.

In this commit, we set an attribute in the fitz.fitz module, so that
both dev scripts and end-user installations can work. This is hacky, but
it targets an old PyMuPDF release after all, so we don't expect things
to break in the long run.

Fixes #737
@apyrgio apyrgio closed this as completed in f75d471 Mar 5, 2024
@apyrgio
Copy link
Contributor

apyrgio commented Mar 5, 2024

This issue has been fixed in our repo, but we also need to ship a new Dangerzone version for affected users. What we plan to do shortly is:

  1. Use the latest commit in main (f75d471 as of writing this).

  2. Bump the release number from 1 to 2 in install/linux/dangerzone.spec:

    Release: 1%{?dist}

  3. Build a dangerzone-qubes-0.6.0-2.fc38.x86_64.rpm package. That is, build an RPM only for Fedora 38 and Qubes.

  4. Publish this package in our yum-tools-prod repo, so that users who have installed 0.6.0-1 will get updated to 0.6.0-2.

  5. Create a v0.6.0-2 tag in the Dangerzone repo, as we did for the Fedora 37 hotfix in v0.4.0.

apyrgio added a commit to apyrgio/yum-tools-prod that referenced this issue Mar 5, 2024
The Fedora 38 build for `dangerzone-qubes` had a bug in the OCR phase.
Publish a new dangerzone-qubes RPM that fixes it, with a bump in the
release number from 1 to 2, so that end-users can get upgraded.

Refs freedomofpress/dangerzone#737
@apyrgio
Copy link
Contributor

apyrgio commented Mar 5, 2024

@GWeck we have a 0.6.0-2 release out for Fedora 38, that fixes this problem in particular. If you still have any issues with the new update, please let us know. Agains, thanks a lot for the bug report 🙂

apyrgio added a commit that referenced this issue Mar 5, 2024
Bump the release number from 1 to 2, so that we can build a
dangerzone-qubes 0.6.0-2 package for Fedora 38 that existing 0.6.0 users
can update to.

Refs #737
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants