-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PDF handling #59
base: master
Are you sure you want to change the base?
Add PDF handling #59
Conversation
A small DEMO to upload PDF file and get recognized data.
|
src/webSession.js
Outdated
); | ||
if(req.type === 'document') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after keywords
"params": { | ||
"Bucket": "" | ||
}, | ||
"accessKeyId": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zotero house style: accessKeyID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are just passing the whole s3Upload
object as a S3 config. It's useful because there can be other configuration options that are needed to pass i.e. region
. But the variable naming doesn't depend on us.
src/http.js
Outdated
}, function(error, response, body) { | ||
if (error) { | ||
return reject(error); | ||
} | ||
|
||
if(!response.headers['content-type']) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after keywords
src/http.js
Outdated
@@ -208,11 +223,23 @@ Zotero.HTTP = new function() { | |||
} | |||
} | |||
} | |||
else if( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after keywords
src/webSession.js
Outdated
let uploadID = await Recognizer.upload(req.response); | ||
let item = await Recognizer.recognize(uploadID); | ||
this.ctx.response.body = Zotero.Utilities.itemToAPIJSON(item); | ||
await Recognizer.remove(uploadID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like here and in handleProcess
we should do the remove
in a try
/catch
within a finally
so that the file is removed even if something fails. (We'll have a lifecycle policy on the bucket, but we should still try to clean up properly.)
src/server.js
Outdated
|
||
const app = module.exports = new Koa(); | ||
app.use(cors); | ||
app.use(bodyParser({ enableTypes: ['text', 'json']})); | ||
app.use(_.post('/web', WebEndpoint.handle.bind(WebEndpoint))); | ||
app.use(_.post('/search', SearchEndpoint.handle.bind(SearchEndpoint))); | ||
app.use(_.post('/export', ExportEndpoint.handle.bind(ExportEndpoint))); | ||
app.use(_.post('/recognize/getUploadParams', Recognizer.handleUpload.bind(Recognizer))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this isn't actually creating state or doing any work, this could be a GET instead of a POST. And then just GET /recognize/uploadParams
.
@mrtcode this sounds like an important feature. If I understand correctly, this would be able to extract citation metadata for PDF URLs such as https://openreview.net/pdf?id=BkeCW-q6aQ. I don't understand how this functionality relates to the |
@dhimmel this feature allows to have the same PDF recognition functionality as we have now on Zotero client. So if the client recognizes this PDF,
|
Thanks for the info.
We're setting up a translation-server for Manubot users as per manubot/manubot#82. We'd prefer to have PDF recognition occur on the same server to minimize the setup complexity. What's the reasoning behind having recognizer-server operate on AWS Lambda versus wherever is running the translation-server? |
We're trying to use Lambda for translation-server too. |
(And Lambda has an input size limit, so we can't pass the file straight to recognizer-server (or to translation-server for direct PDF uploading).) |
Ah I didn't realize that AWS Lambda / s3 was going to become a dependency for full functionality of the codebase. Would it be possible for there to be an option to run Currently, we're running translation-server on a Google Cloud instance, but would also like to get individual Manubot instances (i.e. end users) running their own translation-servers behind the scene. |
It isn't. translation-server itself still runs fine on Node, and will continue to do so. recognizer-server requires a database with ongoing maintenance, so it adds a lot more complexity than translation-server on its own (and certainly wouldn't work for an end-user install). PDF recognition has never been a part of translation-server up to now, and like text search (a.k.a. identifier-search), which also requires a separate database, this just isn't something that we're considering part of public functionality at this time. As in the Zotero client, the recommended way to save is to save from article pages, not from PDFs directly. For your example PDF, that would mean saving from https://openreview.net/forum?id=BkeCW-q6aQ instead of the PDF URL. |
Great. Didn't realize the PDF recognition had such complexity (i.e. the separate database requirement). Perhaps at some point in the future, we can take a second look to see if a simplified PDF recognition workflow is possible.
We will make this recommendation to Manubot users as well. Although there are some PDFs, for example https://bitcoin.org/bitcoin.pdf, that are highly cited with no HTML page substitute. |
This is merged now with the changes made by #69, also changed |
If the remote URL is text/plain or one of the known content types for the formats we support (e.g., application/x-bibtex), try to handle it via import translation. This allows, for example, a remote BibTeX file to be translated. This adapts some of the code from #59 in a more general way, with a responseTypeMap property that can be passed to HTTP.request().
PR dead or resuscitable? |
Contains revert of 971b815 which messed with jenkin's docker build. Update to cf8fb42 Squashed commit of the following: commit cf8fb42 Author: Dan Stillman <[email protected]> Date: Wed Jul 10 05:09:58 2019 -0400 Update version commit 0c0bc8e Author: Dan Stillman <[email protected]> Date: Mon Jul 8 22:34:48 2019 -0400 Make the fix in 6f3afe0 even simpler (zotero#105, zotero#106) We already converted the object to a string, so we don't need to do it again. commit 6f3afe0 Author: Dan Stillman <[email protected]> Date: Mon Jul 8 22:28:17 2019 -0400 Fix multiple translation for translators that use preselection Some translators, such as PubMed, return objects with 'title' and 'checked' (based on checkboxes on the page) instead of strings for multiple selection. We only want the titles for item comparison. Fixes zotero#105, closes zotero#106 commit ae47eb0 Author: Dan Stillman <[email protected]> Date: Wed Jun 26 02:52:50 2019 -0400 Update submodules commit a95cf74 Author: Dan Stillman <[email protected]> Date: Tue Jun 25 03:49:12 2019 -0400 Update README Including instructions for running from a Docker Hub image commit f1c0b6a Author: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Date: Mon Jun 10 15:37:52 2019 -0400 Bump js-yaml from 3.13.0 to 3.13.1 (zotero#104) commit 971b815 Author: Stéphane Gully <[email protected]> Date: Mon Jun 10 21:32:41 2019 +0200 Docker (zotero#24) commit 8a39f2b Author: Dan Stillman <[email protected]> Date: Mon Jun 10 15:30:42 2019 -0400 Update translators commit 4e2df9b Author: Dan Stillman <[email protected]> Date: Fri May 24 00:19:02 2019 -0400 Fix invalid "TZ" in BibTeX urldate when accessDate includes a timestamp BibTeX is still a legacy-mode translator that expects SQL dates instead of ISO 8601 dates. commit 1295985 Author: Dan Stillman <[email protected]> Date: Thu Apr 4 22:57:59 2019 -0400 Correct link to valid export formats in README commit 7fa75d4 Author: Dan Stillman <[email protected]> Date: Thu Apr 4 18:04:01 2019 -0400 HTTP → HTTPS in package-lock.json entry commit 2dfd9dc Author: Dan Stillman <[email protected]> Date: Thu Apr 4 12:53:06 2019 -0400 Translate import formats via /web If the remote URL is text/plain or one of the known content types for the formats we support (e.g., application/x-bibtex), try to handle it via import translation. This allows, for example, a remote BibTeX file to be translated. This adapts some of the code from zotero#59 in a more general way, with a responseTypeMap property that can be passed to HTTP.request(). commit c5dd0ed Author: Dan Stillman <[email protected]> Date: Thu Apr 4 12:50:16 2019 -0400 Inherit ESLint config and fix some lint errors commit 7c6a2df Author: Antonin Delpeuch <[email protected]> Date: Mon Apr 1 20:43:53 2019 +0100 Make the listening host configurable (zotero#98) commit 083f239 Author: Emiliano Heyns <[email protected]> Date: Mon Apr 1 21:41:57 2019 +0200 Import tests: normalize some fields, compare order (zotero#100) commit 6045c70 Author: Dan Stillman <[email protected]> Date: Tue Mar 12 17:20:12 2019 -0400 Set a timeout for text search after which results so far are returned commit d268bc7 Author: Dan Stillman <[email protected]> Date: Tue Mar 12 17:19:21 2019 -0400 Fix sending of Link header for text search results commit acff4e3 Author: Dan Stillman <[email protected]> Date: Mon Mar 11 06:00:17 2019 -0400 Use --data-binary in correct README example commit a03f317 Author: Dan Stillman <[email protected]> Date: Mon Mar 11 05:59:10 2019 -0400 Use --data-binary in /import example in README commit eea4a6b Author: Emiliano Heyns <[email protected]> Date: Mon Mar 11 10:02:20 2019 +0100 Verify translator used for import test (zotero#90) commit d7e575b Author: Dan Stillman <[email protected]> Date: Sun Mar 10 19:20:17 2019 -0400 Fix response format from /import The result should be API JSON, not translator JSON. If we want to use translator tests, we'll need to convert the test JSON to API JSON for the comparison. Follow-up to zotero#89 commit 72d2312 Author: Philipp Zumstein <[email protected]> Date: Sun Mar 10 23:41:02 2019 +0100 Add description about import endpoint to README.md (zotero#91) commit 807a821 Author: Emiliano Heyns <[email protected]> Date: Sun Mar 10 22:38:12 2019 +0100 Remove 'translatorID' parameter from /import (zotero#94) Fixes zotero#93 commit a10f068 Author: Dan Stillman <[email protected]> Date: Fri Mar 8 16:51:37 2019 -0500 Temporarily disable import tests Existing tests need to be updated for these to pass. commit 8ca1803 Author: Emiliano Heyns <[email protected]> Date: Fri Mar 8 22:43:50 2019 +0100 add import support (zotero#89) commit 3f0386d Author: Dan Stillman <[email protected]> Date: Fri Mar 1 03:18:02 2019 -0500 Update translators submodule commit 1b49afe Author: Dan Stillman <[email protected]> Date: Fri Mar 1 03:17:18 2019 -0500 Update Zotero submodule Fixes zotero#88 commit b5dc510 Author: Martynas Bagdonas <[email protected]> Date: Tue Feb 26 00:24:53 2019 +0100 Update packages (zotero#86) commit 7bce4a3 Author: Dan Stillman <[email protected]> Date: Mon Feb 25 06:57:41 2019 -0500 Add (currently failing) test for TEI export commit 4522b3e Author: Dan Stillman <[email protected]> Date: Mon Feb 25 06:53:41 2019 -0500 Set item key to 'uri' property for export Zotero.Utilities.Internal.itemToExportFormat() adds a URI to the export data passed to translators so that they can generate a consistent identifier if necessary. Most translators don't need this but at least some (e.g., TEI) do. Unfortunately we can't build a proper item URI because we don't have user/group info, so we just use the item key instead. commit 2e47a42 Author: Martynas Bagdonas <[email protected]> Date: Thu Feb 7 18:45:26 2019 +0100 Update wicked-good-xpath to fix incorrect attribute node order (zotero#84) commit 9b018f3 Author: Dan Stillman <[email protected]> Date: Wed Feb 6 23:00:19 2019 -0500 Replace lambda_deploy with lambda_package commit e0e31ec Author: Dan Stillman <[email protected]> Date: Wed Feb 6 22:57:30 2019 -0500 Update translators commit fd4d94c Author: Dan Stillman <[email protected]> Date: Sat Feb 2 13:05:05 2019 -0500 Test runner improvements - Grep on id in addition to label - Make -o (output file) optional - Return non-0 exit code if any tests failed Addresses zotero/translators#1310 commit 8b83a47 Author: Martynas Bagdonas <[email protected]> Date: Wed Jan 30 08:26:20 2019 +0100 A temporary workaround for JSDOM content decoding issues (zotero#81) commit 3d11fc9 Author: Dan Stillman <[email protected]> Date: Mon Jan 14 15:32:29 2019 -0500 More fixes for exporting minimal data without .creators or .tags Addresses zotero#75 commit 9339404 Author: Adomas Venčkauskas <[email protected]> Date: Mon Jan 14 17:18:36 2019 +0200 Fix a json legacy export bug (regression e02ee7a). Closes zotero#75 commit 106c721 Author: Dan Stillman <[email protected]> Date: Thu Jan 10 21:38:10 2019 -0500 Add test for Bibliontology RDF export Follow up to zotero#73 commit e02ee7a Author: Adomas Venčkauskas <[email protected]> Date: Thu Jan 10 14:40:57 2019 +0200 Fix pre 4.0.27 translator exports. Closes zotero#73 commit 158ebc6 Author: Martynas Bagdonas <[email protected]> Date: Wed Jan 9 22:11:35 2019 -0500 Restore DOI from URL extraction (zotero#72) commit 5848d9e Author: Dan Stillman <[email protected]> Date: Mon Jan 7 03:19:36 2019 -0500 Allow "pmid:" prefix in PMIDs Closes zotero#71 commit f8158ec Author: Dan Stillman <[email protected]> Date: Mon Jan 7 02:57:40 2019 -0500 Update translators submodule commit 33e971d Author: Dan Stillman <[email protected]> Date: Thu Jan 3 23:05:47 2019 -0500 Update config file pattern in .gitignore commit 8dc70dc Author: Dan Stillman <[email protected]> Date: Thu Jan 3 23:04:26 2019 -0500 Fix CORS via API Gateway commit ef64844 Author: Dan Stillman <[email protected]> Date: Sat Dec 29 19:43:55 2018 -0500 Add Travis config commit 5cf698e Author: Dan Stillman <[email protected]> Date: Sun Dec 23 02:37:48 2018 -0500 Return 400 for missing or unsupported upstream content type Follow-up to zotero#69 commit 9f2c366 Author: Martynas Bagdonas <[email protected]> Date: Sun Dec 23 01:02:07 2018 -0600 Limit response size (zotero#69) commit 757e21c Author: Dan Stillman <[email protected]> Date: Fri Dec 21 06:06:35 2018 -0500 Make translators directory customizable via environment variable Set TRANSLATORS_DIR to override the location in the config. (This is useful for provo, where translation-server is a submodule, so a local.json would be sort of awkward, and we want to share a translators directory at the provo root.) commit e1457a9 Author: Dan Stillman <[email protected]> Date: Fri Dec 21 02:41:55 2018 -0500 Fix error when exporting notes as CSL-JSON (But don't export notes as CSL-JSON) Reported in zotero#67 commit 3207162 Author: Dan Stillman <[email protected]> Date: Fri Dec 21 02:24:17 2018 -0500 Fix missing parentItem in child notes As shown in zotero#67 commit ff54ad6 Author: Martynas Bagdonas <[email protected]> Date: Mon Dec 17 09:37:05 2018 +0000 Use JSDOM from NPM (zotero#64) commit 9f70ccc Author: Dan Stillman <[email protected]> Date: Sun Dec 16 22:20:15 2018 -0500 Update translators commit 2b2647d Author: Dan Stillman <[email protected]> Date: Sun Dec 16 21:33:46 2018 -0500 Fix COinS export and add Zotero RDF test commit 0458529 Author: Dan Stillman <[email protected]> Date: Sun Dec 16 21:01:41 2018 -0500 Update translators submodule commit 57908ef Author: Dan Stillman <[email protected]> Date: Fri Dec 14 22:59:06 2018 -0500 Update package-lock.json (HTTP -> HTTPS) commit cd8b177 Author: Martynas Bagdonas <[email protected]> Date: Fri Dec 14 22:38:26 2018 +0000 Install precompiled JSDOM module to speed up installation from commit (zotero#62) commit 3c80836 Author: Dan Stillman <[email protected]> Date: Thu Dec 13 16:14:47 2018 -0500 Update package-lock.json commit 88228e5 Author: Martynas Bagdonas <[email protected]> Date: Thu Dec 13 19:04:43 2018 +0000 Remove the temporary JSDOM slowdown workaround and use the fixed JSDOM (zotero#61) commit 8b8355a Author: Dan Stillman <[email protected]> Date: Thu Dec 13 02:18:45 2018 -0500 Add local.json to .gitignore commit 517e264 Author: Dan Stillman <[email protected]> Date: Thu Dec 13 02:16:15 2018 -0500 Fix CSL JSON export when there's an Extra value Fixes zotero/zotero#1604 commit 6160d86 Author: Dan Stillman <[email protected]> Date: Wed Dec 12 23:38:25 2018 -0500 Fix text search test Fixes zotero#60 commit 802d3d8 Author: Dan Stillman <[email protected]> Date: Tue Dec 4 02:59:17 2018 -0700 Update translators submodule commit e763651 Author: Dan Stillman <[email protected]> Date: Tue Dec 4 02:55:32 2018 -0700 Forward client Accept-Language header during web translation Addresses zotero#16 commit 6464497 Author: Dan Stillman <[email protected]> Date: Sat Nov 17 02:39:08 2018 -0500 Fix CSL JSON export Not sure if this didn't work after d1bc532 or something changed since, but this seems to work now. Fixes zotero#56 commit e03e4f1 Author: Daniel Himmelstein <[email protected]> Date: Fri Nov 16 17:35:50 2018 -0500 README installation: touch-up commands (zotero#55) Use long arguments so new users know what they are doing. git clone: default behavior is to clone into `translation-server` directory. commit 653a767 Author: Dan Stillman <[email protected]> Date: Fri Nov 16 17:16:02 2018 -0500 Document proxy support Closes zotero#54 commit aed6c83 Author: Martynas Bagdonas <[email protected]> Date: Fri Nov 16 17:56:12 2018 +0800 Use version from install.rdf (zotero#53) commit f3436f2 Author: Martynas Bagdonas <[email protected]> Date: Fri Nov 16 17:21:12 2018 +0800 Fix gzip problem and missing package.json (zotero#52) commit 286f65c Author: Martynas Bagdonas <[email protected]> Date: Tue Nov 13 05:43:14 2018 +0800 Translate doi.org URL instead of extracting DOI (zotero#50) commit a185c5c Author: Adomas Venčkauskas <[email protected]> Date: Fri Oct 26 16:33:16 2018 +0300 Fix meta redirect bugs when meta tag is malformed commit 9c15114 Author: Adomas Venčkauskas <[email protected]> Date: Fri Oct 19 13:50:41 2018 +0300 Add translator tester commit edfe9e5 Author: Martynas Bagdonas <[email protected]> Date: Sat Nov 3 01:53:24 2018 +0800 A workaround for JSDOM slowdown for some websites (zotero#47) commit 637ce02 Author: Dan Stillman <[email protected]> Date: Mon Oct 29 06:01:07 2018 -0400 Disable text search with ?text=0 commit 29b515a Author: Dan Stillman <[email protected]> Date: Sat Oct 27 02:53:53 2018 -0400 Automatically prepend 'http://' for URL-ish strings for /web requests commit 1d15419 Author: Dan Stillman <[email protected]> Date: Sat Oct 27 01:42:00 2018 -0400 Return CORS headers on errors Koa normally strips headers on errors commit f952cc4 Author: Dan Stillman <[email protected]> Date: Thu Oct 25 00:59:55 2018 -0400 Remove debug lines commit 21332ad Author: Dan Stillman <[email protected]> Date: Tue Oct 23 02:50:11 2018 -0400 Fix webpage fallback commit 01aa1e4 Author: Dan Stillman <[email protected]> Date: Tue Oct 23 02:48:20 2018 -0400 Update submodules commit 1b5d429 Author: Martynas Bagdonas <[email protected]> Date: Tue Oct 23 06:59:17 2018 +0800 Re-translate multi item URL if cached session is not found (zotero#46) commit bc7a18e Author: Dan Stillman <[email protected]> Date: Mon Oct 22 03:59:24 2018 -0400 Support ?single=1 to force single-page saving Closes zotero#45 commit f170ad3 Author: Dan Stillman <[email protected]> Date: Sun Oct 21 03:19:44 2018 -0400 Fix WebSession object name commit cffd535 Author: Dan Stillman <[email protected]> Date: Mon Oct 8 01:09:37 2018 -0400 Update package-lock.json commit 15a2961 Author: Dan Stillman <[email protected]> Date: Mon Oct 8 01:09:00 2018 -0400 Support allowedOrigins config option for CORS commit ad454e0 Author: Dan Stillman <[email protected]> Date: Tue Oct 2 15:30:47 2018 -0400 Add AWS SDK This is available by default on Lambda but without it Node installs break (even though they don't use it). commit 19bd71a Author: Dan Stillman <[email protected]> Date: Tue Oct 2 15:18:43 2018 -0400 Add missing config variable commit 73b99ff Author: Dan Stillman <[email protected]> Date: Tue Oct 2 15:17:33 2018 -0400 Fix "TLDS is not defined" error commit 632ea8d Author: Martynas Bagdonas <[email protected]> Date: Tue Oct 2 17:07:36 2018 +0800 Invoke identifier-search Lambda function (zotero#42) Bug: T223115 Change-Id: I5ec66601c84b085f2a82e76a85223a45f52ed9b2
Solves #38. Recognizes metadata when a PDF file is uploaded or a PDF URL is passed to
/web
endpoint. For this to work recognizer-server lambda branch has to be deployed.