-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade document extraction #187
Conversation
Beacuse of changes to LXML
Add pdfplumber as main tool for extracting text from a PDF - and add a strip margin flag to enable cropping out text in the margins and removing skewed text
Added and fixed tests Modified one test pdf to better reflect the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I made a few comments, but none that I think is too crazy. My one remaining doubt is what the output looks like compared to the old output. Can you provide some examples of normal, good, bad, ugly, etc so we can see the improvement here?
I also worry about if we move to striping margins by default that that will cause trouble down the road when we remove more than we want, like, for example, on a scanned document where the scan is off center or something. Maybe it's safer to remove the margin at the top and left and leave the bottom and right?
doctor/lib/utils.py
Outdated
return "\n".join(page_content) | ||
|
||
|
||
def ocr_needed(path: str, content: str, page_count: int) -> [bool, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the checks in this function in roughly performant order, such the fast checks come first, and difficult ones come later? Might be a good idea?
I have to go back thru the rest of your comments and I will provide some sample output but I wanted to address a few things.
|
Change extract from pdf to drop ocr available flag
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, Bill. I think it's pretty good. Nice and tidy on the whole.
I think the main thing I'd like to see are more comments practically everywhere. There are a lot of finicky things in here that will be really hard to work on in the future unless it's commented ad nasuem.
Other than that, the other missing piece is a few lines in the changelog. We should make sure to do that too.
I'm chatting with a customer now that values doctor for its high-speed text extraction. Could we keep pdftotext in this PR, and have a v2 text extractor that has all your improvements? |
I heavily simplified the code and created a NEW pr for it. or am - so im closing this PR |
@mlissner
This PR is meant to improve the extraction of text from PDFs by using a few
additional simple rules to decide if text is extracted appropriately.
Those rules include
Err strings were added for each of these reasons, which should be used in checking if OCR is needed on the CL side.
Previously we would/could identify documents as needing OCR - but also returning the text none the less - so that pages could be missed and CL wouldnt be aware that it might want to OCR the document.
Additionally, an optional flag has been added
skip-margins
as a boolean that can be used tocrop out the 1 inch margins that are required for court opinions as well as skewed stamp text we see in some
courts. This is meant to get the text to represent the text of the opinion.
Tests were updated for the PDF changes and different possible difficult Pdfs were included.
Finally, a change to LXML and html Cleaning was addressed by adding `lxml_html_clean'.