Digitization for Documents in Indian Languages

A repository that combines OCR with custom Layout Detection for Sanskrit and English documents. Simply upload your images or pdfs for OCR in the 'test images' folder of this repository, choose a model for OCR and Layout Detection, and recieve the OCR output and infered image in your output folder!

To run lp_ocr.py, wrapper for Layout Parser OCR for the first time:

Create a venv and activate:

virtualenv lp_ocr
source lp_ocr/bin/activate

Install all packages in the environment:

pip3 install -r requirements.txt
apt install tesseract-ocr
apt install libtesseract-dev
apt-get install poppler-utils

For layout detection and OCR:
- Run lp_ocr.py and select 'yes' when asked if layout detection should be applied
- Choose a custom layout model. eg. Choose a Sanskrit model if your image/pdf is in an Indic language.
- Choose an OCR model.
- Define your output folder name and find the OCR'd text + a Label Studio formatted output retrieved from infered bounding boxes from the Layout Detection model.
For document layout analysis of an image:
- Run layout_inference.py. This will return an infered image with masks and a json file with layout data - bounding boxes.
For OCR of a directory of images:
- Run lp_ocr.py and select 'no' when asked if layout detection should be applied, and supply your input image directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Digitization for Documents in Indian Languages

Files

README.md

Latest commit

History

README.md

File metadata and controls

Digitization for Documents in Indian Languages