Skip to content

Latest commit

 

History

History
70 lines (46 loc) · 1.48 KB

README.md

File metadata and controls

70 lines (46 loc) · 1.48 KB

opticr

Python library to expose a single interface and API to few OCR tools (google vision, Tesseract)

Install

Required binaries available in the $PATH

poppler-utils (pdf2image)

https://github.com/Belval/pdf2image#how-to-install

tesseract

https://tesseract-ocr.github.io

Install OpticR

With pip

pip install opticr

With poetry

poetry add opticr

or to get the latest 'dangerous' version

poetry add  git+https://github.com/lzayep/opticr@main

Usage

from opticr import OpticR

ocr = OpticR("tesseract")
pathtofile = "test/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile)

With google-vision:

from opticr import OpticR

ocr = OpticR("google-vision", options={"google-vision": {"auth": {"token": ""}}})

# file could come from an URL
pathtofile = "https://example.com/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile)

Cache the result, if the file as already been OCR return immediatly the previous result. Result are stored temporarly in the local storage or shared storage such as Redis.

from opticr import OpticR

ocr = OpticR("tesseract", options={"cache":
                         {"backend": "redis", redis: "redis://"}}

# file could come from an URL
pathtofile = "https://example.com/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile, cache=True)