status command should show if OCR has completed #17

simonw · 2022-06-30T20:47:01Z

This is actually quite difficult.

It turns out the textract-output/JOB_ID folder is created, empty, early on in the process. Then files called 1 and 2 and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.

The text was updated successfully, but these errors were encountered:

simonw · 2022-06-30T20:48:32Z

I think the only reliable way of telling if OCR has completed is to call inspect-job:

s3-ocr inspect-job job_id command #15

But that's quite expensive, because it also returns the first page of JSON - which could be ~1MB of data.

I think the most efficient way to do this would be to check the expensive API for completion of each job, but then to update the .s3-ocr.json file for that key to cache the fact that we know that OCR has completed.

simonw · 2022-06-30T20:49:55Z

Another option: add a file called key.pdf.s3-ocr-complete.json indicating the OCR has finished. That way we don't need to GET each individual file to check status - we can check status on everything just by listing all keys in the bucket.

Even better: if we change the design of those JSON files to all live in the s3-ocr/ folder instead we can do a status check just with a single fetch of every key starting with that prefix, see:

Consider using /s3-ocr/key instead of key.s3-ocr.json #14

simonw added the enhancement New feature or request label Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status command should show if OCR has completed #17

status command should show if OCR has completed #17

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

status command should show if OCR has completed #17

status command should show if OCR has completed #17

Comments

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022

simonw commented Jun 30, 2022