Search pre-defined keywords into the scanned PDF files using Levenshtein algorithm.
Python
Tesseract
Requires libtesseract
(>=3.04) and libleptonica
(>=1.71).
On Debian/Ubuntu:
$ sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
On RedHat/Fedora:
$ sudo dnf install tesseract tesseract-devel leptonica-devel leptonica
$ git clone <project_repo>
$ cd <project_directory>/
$ pip install -r requirements/dev.txt
$ python -m build
For Windows
$ pip install dist/ocrmatcher-<version>-py3-none-any.whl
For Linux
$ pip install dist/ocrmatcher-<version>-tar.gz
- Add
dataset
folder current directory - Add Scanned
PDF
files intodataset
directory - Add
keywords.txt
file intodataset
directory - Add Search Keywords to
keywords.txt
file (each keywords must be new line without numbering)
List of available commands
$ ocrmatcher --help
Or
$ python -m ocrmatcher --help
Add new keywords by add-keywords
command
$ ocrmatcher add-keywords --k my-search-keyword1 my-search-keyword2 etc.
Search Keywords
$ ocrmatcher search
Run with specific language
Search Keywords
$ ocrmatcher search --lang Occupant-Pigs
Run with specific threshold
for two strings similarity, default is: 95
Search Keywords
$ ocrmatcher search --threshold 75
Pdf file convert to images
$ ocrmatcher pdf2img