-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new recap PDF extraction endpoint #190
Conversation
Add new endpoint for recap documents add new file text extraction
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I went back and looked at the comments in #187, and tried to move them here. There's a handful that still aren't resolved.
I would have preferred to have that PR be continued instead of opening a new one because this way I have to do an entirely new review and move things over. It takes a lot more time.
Anyhow, that's done and we've got some comments in the attached. Mostly we still need explanations of heuristic things you're doing and some tests would be helpful too.
The only other thing is that with the new endpoint, we need a bit of documentation:
- Can you please add something to the readme?
And update its doc strings
…ect/doctor into add-recap-extraction
Add detailed information on config str used in OCRing the documents
Add explination for get word func and why we chose mysterious parameters
Simplify the adjust caption and add tests for it Also move it to the end of the function to avoid any whitespace fixes that might affect it
for more information, see https://pre-commit.ci
…ect/doctor into add-recap-extraction
@mlissner a second pass would be appreciated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now we're on the homestretch. I found a few little tweaks, and we still need an update to the readme, but nothing substantive.
The tests are a huge help to both understanding the code and making sure it doesn't regress. Thanks!
Small tweaks to doc strings Update func names slight refactor of adjust caption lines
for more information, see https://pre-commit.ci
Thanks for the comments @mlissner - back to you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the follow through as always!
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
@mlissner
This PR adds a new api endpoint specifically for RECAP Documents.
Instead of versioning the document extraction we streamlined a new API endpoint for RECAP documents.
They don't come as RTF or Txt or mp3 so it just handles PDFs.
This endpoint also drops
ocr_available
flag as it will no longer be useful