You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It turns out the textract-output/JOB_ID folder is created, empty, early on in the process. Then files called 1 and 2 and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.
The text was updated successfully, but these errors were encountered:
But that's quite expensive, because it also returns the first page of JSON - which could be ~1MB of data.
I think the most efficient way to do this would be to check the expensive API for completion of each job, but then to update the .s3-ocr.json file for that key to cache the fact that we know that OCR has completed.
Another option: add a file called key.pdf.s3-ocr-complete.json indicating the OCR has finished. That way we don't need to GET each individual file to check status - we can check status on everything just by listing all keys in the bucket.
Even better: if we change the design of those JSON files to all live in the s3-ocr/ folder instead we can do a status check just with a single fetch of every key starting with that prefix, see:
This is actually quite difficult.
It turns out the
textract-output/JOB_ID
folder is created, empty, early on in the process. Then files called1
and2
and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.The text was updated successfully, but these errors were encountered: