Skip to content
/ opticr Public

Library to expose a single interface and API to few OCR tools (google vision, tesseract)

License

Notifications You must be signed in to change notification settings

lzayep/opticr

Repository files navigation

opticr

Python library to expose a single interface and API to few OCR tools (google vision, Tesseract)

Install

Required binaries available in the $PATH

poppler-utils (pdf2image)

https://github.com/Belval/pdf2image#how-to-install

tesseract

https://tesseract-ocr.github.io

Install OpticR

With pip

pip install opticr

With poetry

poetry add opticr

or to get the latest 'dangerous' version

poetry add  git+https://github.com/lzayep/opticr@main

Usage

from opticr import OpticR

ocr = OpticR("tesseract")
pathtofile = "test/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile)

With google-vision:

from opticr import OpticR

ocr = OpticR("google-vision", options={"google-vision": {"auth": {"token": ""}}})

# file could come from an URL
pathtofile = "https://example.com/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile)

Cache the result, if the file as already been OCR return immediatly the previous result. Result are stored temporarly in the local storage or shared storage such as Redis.

from opticr import OpticR

ocr = OpticR("tesseract", options={"cache":
                         {"backend": "redis", redis: "redis://"}}

# file could come from an URL
pathtofile = "https://example.com/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile, cache=True)

About

Library to expose a single interface and API to few OCR tools (google vision, tesseract)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published