Skip to content

Latest commit

 

History

History
134 lines (86 loc) · 2 KB

README.md

File metadata and controls

134 lines (86 loc) · 2 KB

About - OCR Toolkit

Search pre-defined keywords into the scanned PDF files using Levenshtein algorithm.

Prerequisites


Python
Tesseract

Install dependencies for Linux


Requires libtesseract (>=3.04) and libleptonica (>=1.71).

On Debian/Ubuntu:

$ sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config

On RedHat/Fedora:

$ sudo dnf install tesseract tesseract-devel leptonica-devel leptonica

Install dependencies for Windows


  1. Tesseract Docs
  2. Tesseract
  3. Leptonica

Setup Project


$ git clone <project_repo>
$ cd <project_directory>/

Install Source dependencies from requirements


$ pip install -r requirements/dev.txt

Package Build and Install


$ python -m build

For Windows

$ pip install dist/ocrmatcher-<version>-py3-none-any.whl

For Linux

$ pip install dist/ocrmatcher-<version>-tar.gz

Using


  1. Add dataset folder current directory
  2. Add Scanned PDF files into dataset directory
  3. Add keywords.txt file into dataset directory
  4. Add Search Keywords to keywords.txt file (each keywords must be new line without numbering)

Commands


List of available commands

$ ocrmatcher --help

Or

$ python -m ocrmatcher --help

Add new keywords by add-keywords command

$ ocrmatcher add-keywords --k my-search-keyword1 my-search-keyword2 etc.

Search Keywords

$ ocrmatcher search 

Run with specific language

Search Keywords

$ ocrmatcher search --lang Occupant-Pigs

Run with specific threshold for two strings similarity, default is: 95

Search Keywords

$ ocrmatcher search --threshold 75

Pdf file convert to images

$ ocrmatcher pdf2img