Preprocessing the files is something that can't be explained, as every file has to be preprocessed differently depending on its type and content. At the end of the day you just need a text file with verse number at the beginning of each verse, which can be added easily
I will share here few things that might help you in the process.
-
Poppler:
Poppler provides few binaries that helps in many pdf related tasks such as converting text based pdf to text file or converting image based pdf to tiff files or getting pdf metadata, extracting the text using pdf coordinates and many more things.
Windows users can install by performing these steps -
qpdf:
Some pdf's are secured or encrypted, to remove restrictions, it can be decrypted by:
qdf --decrypt secured.pdf unsecured.pdf
-
OpenCV:
Incase you want to process tiff images, then you will have to preprocessing the images(cropping etc) before feeding to an OCR software. It can be installed for python using:
pip3 install opencv-python
-
Tesseract:
After you have got good quality tiff images, then those can be converted into text form using tesseract -
Regular Expressions:
You have to be good in using regular expressions, this will help in solving some of the toughest problems which cannot be solved programmatically. -
Github Desktop:
Create a new repo at github to store the files to be preprocessed and to the preprocessing take inside the git repo(i.e after every changes to the files, commit the changes), this will help you to go back to the initial file, incase you did some mistakes and also it will help you to see what things have been changed using Github Desktop gui
I have written few notes for myself and they are pretty unstructured, incase you want to have a look, you can see them here
You can always ask me ,incase you are stuck somewhere.