-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF recognition #38
Comments
So PDF must be recognized when:
For 1) we have to take over the URL if it turned out to be a PDF URL, and instead of processing it with translators, upload it to S3 and trigger further processing. Currently if Then we'll need to update Next, we should limit download size, but with Now for option 2), the client should firstly get a signed URL from t-s, then upload a file and then query t-s again. |
Hey, I wonder is there any progress on it? |
FYI, a notable software library to extract metadata from PDFs is grobid: https://github.com/kermitt2/grobid |
The https://github.com/zotero/recognizer-server repo is not publicly available, apparently because it isn't self-contained: https://forums.zotero.org/discussion/80101/zotero-service-for-metadata-extraction. What external APIs does the service rely on? Stuff like AWS/GCP/Azure OCR services? Then we could figure out how to make it modular so users could use open source alternatives locally. |
Tentative plan:
When downloading a URL, either make a HEAD request first to see if the URL is a PDF or, if possible, gracefully handle PDF downloads in
Zotero.HTTP.request()
with a maximum download size.Add another endpoint that accepts PDF data.
Once we have the PDF data, upload that to a new recognizer-server endpoint.
recognizer-server might send the PDF data to a Lambda for pdftotext processing, or it might be in Lambda itself if we move the DB from SQLite to MySQL
translation-server gets back identifiers from recognizer-server, runs translation on them, and returns metadata
The text was updated successfully, but these errors were encountered: