Add new recap PDF extraction endpoint #190

flooie · 2024-05-28T21:42:09Z

This PR adds a new api endpoint specifically for RECAP Documents.

Instead of versioning the document extraction we streamlined a new API endpoint for RECAP documents.
They don't come as RTF or Txt or mp3 so it just handles PDFs.
This endpoint also drops ocr_available flag as it will no longer be useful

Add new endpoint for recap documents add new file text extraction

for more information, see https://pre-commit.ci

mlissner

Cool. I went back and looked at the comments in #187, and tried to move them here. There's a handful that still aren't resolved.

I would have preferred to have that PR be continued instead of opening a new one because this way I have to do an entirely new review and move things over. It takes a lot more time.

Anyhow, that's done and we've got some comments in the attached. Mostly we still need explanations of heuristic things you're doing and some tests would be helpful too.

The only other thing is that with the new endpoint, we need a bit of documentation:

Can you please add something to the readme?

doctor/tasks.py

doctor/lib/text_extraction.py

And update its doc strings

…ect/doctor into add-recap-extraction

Add detailed information on config str used in OCRing the documents

Add explination for get word func and why we chose mysterious parameters

Simplify the adjust caption and add tests for it Also move it to the end of the function to avoid any whitespace fixes that might affect it

for more information, see https://pre-commit.ci

…ect/doctor into add-recap-extraction

flooie · 2024-05-29T19:55:00Z

@mlissner a second pass would be appreciated

mlissner

OK, now we're on the homestretch. I found a few little tweaks, and we still need an update to the readme, but nothing substantive.

The tests are a huge help to both understanding the code and making sure it doesn't regress. Thanks!

doctor/lib/text_extraction.py

Small tweaks to doc strings Update func names slight refactor of adjust caption lines

for more information, see https://pre-commit.ci

flooie · 2024-05-30T15:52:21Z

Thanks for the comments @mlissner - back to you

mlissner

LGTM. Thanks for the follow through as always!

sentry-io · 2024-06-05T19:55:48Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ ValueError: Bounding box (0, 144.0, 1224, 1440.0) is not fully within parent page bounding box (0, 0, 1224, 792) /extract/recap/text/ View Issue
‼️ PSEOF: Unexpected EOF /extract/recap/text/ View Issue
‼️ PDFSyntaxError: No /Root object! - Is this really a PDF? /extract/recap/text/ View Issue

_{Did you find this useful? React with a 👍 or 👎}

flooie and others added 4 commits May 28, 2024 17:37

feat(text_extraction): Add text extraction api

35c25d2

Add new endpoint for recap documents add new file text extraction

tests(recap) Add recap tests for new endpoint

bd2c7cd

docs(DEVELOPMENT) Fix doc docker call

c4514eb

[pre-commit.ci] auto fixes from pre-commit.com hooks

5cf0869

for more information, see https://pre-commit.ci

flooie changed the title ~~Add recap extraction~~ Add new recap PDF extraction endpoint May 28, 2024

flooie requested a review from mlissner May 28, 2024 21:42

mlissner requested changes May 29, 2024

View reviewed changes

flooie and others added 13 commits May 29, 2024 09:53

fix(text): rename deskew to is_skewed

b02b7aa

And update its doc strings

Merge branch 'add-recap-extraction' of https://github.com/freelawproj…

0078817

…ect/doctor into add-recap-extraction

fix(text): Update docstrings ocr image to data

bd3aace

Add detailed information on config str used in OCRing the documents

fix(text_extraction): Explain get_word

c246bef

Add explination for get word func and why we chose mysterious parameters

fix(text_extraction): Update formatting and docstrings

f9c0b3d

feat(tasks): Drop mojibake fix as unlikely to be needed

8d2dcbf

fix(adjust_caption): Update adjust caption

6d7fe01

Simplify the adjust caption and add tests for it Also move it to the end of the function to avoid any whitespace fixes that might affect it

tests(text_extraction): Add unit tests for new methods

0b30bb9

[pre-commit.ci] auto fixes from pre-commit.com hooks

1078bb9

for more information, see https://pre-commit.ci

test(caption Adjustment): Add new test class

0a2cd5e

Merge branch 'add-recap-extraction' of https://github.com/freelawproj…

098ef99

…ect/doctor into add-recap-extraction

test(workflows) Add v3.11 and v3.12 to tests

9c1c44c

test(adjustment) Add fix for test

7a29189

flooie requested a review from mlissner May 29, 2024 19:55

mlissner requested changes May 30, 2024

View reviewed changes

flooie and others added 3 commits May 30, 2024 11:03

docs(readme) Add endpoint updates

0d12f95

fix(text_extract) Updates from PR

5d19f57

Small tweaks to doc strings Update func names slight refactor of adjust caption lines

[pre-commit.ci] auto fixes from pre-commit.com hooks

5801d2d

for more information, see https://pre-commit.ci

flooie added 2 commits May 30, 2024 11:57

fix(text): Fix variable value

979adf3

fix(tests): Remove print in tests

47d4a04

mlissner approved these changes May 30, 2024

View reviewed changes

mlissner merged commit 5f30530 into main May 30, 2024
10 checks passed

mlissner deleted the add-recap-extraction branch May 30, 2024 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new recap PDF extraction endpoint #190

Add new recap PDF extraction endpoint #190

flooie commented May 28, 2024

mlissner left a comment

flooie commented May 29, 2024

mlissner left a comment

flooie commented May 30, 2024

mlissner left a comment

sentry-io bot commented Jun 5, 2024 •

edited

Loading

Add new recap PDF extraction endpoint #190

Add new recap PDF extraction endpoint #190

Conversation

flooie commented May 28, 2024

mlissner left a comment

Choose a reason for hiding this comment

flooie commented May 29, 2024

mlissner left a comment

Choose a reason for hiding this comment

flooie commented May 30, 2024

mlissner left a comment

Choose a reason for hiding this comment

sentry-io bot commented Jun 5, 2024 • edited Loading

Suspect Issues

sentry-io bot commented Jun 5, 2024 •

edited

Loading