Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PDF] quote in form editor run words together #227

Open
ningyifan opened this issue Oct 23, 2017 · 0 comments
Open

[PDF] quote in form editor run words together #227

ningyifan opened this issue Oct 23, 2017 · 0 comments
Assignees
Labels

Comments

@ningyifan
Copy link
Contributor

ningyifan commented Oct 23, 2017

When I highlight and bring text into the annotation, some of them are bunched up i.e., there are no spaces between words.

Example: Wiley Kwan_1999_9987702, all quote has no space that been saved in Elasticsearch. As comparison, PDF Aldridge 2001 article (from Wiley) keep spaces in the same line but can't interpret return line char (the issue #27 ).

From Amy:
For example, Wiley articles: Aldridge, Andrus, Knudsen, Odishaw, Robertson, and Simonson 2005 run words together a few times, but all annotations for Parra, and Kwan have no spaces in between any words. Dixon had no spaces either and highlighted very oddly (blue highlights very broken up - not a solid blue highlight line like the others) so I did not save it

Analysis:
Reason:
It caused by pdf.js can't handle white space in scanned PDF and will skip return line character in mouse gripping.
Action:
(1) OCR all scanned PDF would work. Missing return line char will be fixed at mean time.
Kwan_1999 works good after OCR
Awni_1995 is scanned book that not able to annotate part of article

Issues
In some cases, OCR may incorrectly interpret content in visually hard to read document

ex. Awni_1995
Zileuton (Ahhotr-64077) is a potent inhibitor of leukotriene biosynthesis (original)
Zileuton (Ahhotr-64077) is cl potent inhibitor of leukotriene bio.,ynthesis (OCR)

ex. Kwan_1999
The concentration of the (R)-and (S)-enantiomers of warfarin in the serum (original)
The concentration of the {R)-and (S)-enantiomers of warfarin in the serum (OCR)

Workflow:

  1. OCR scanned PDFs
  2. Annotator highlight claim, data, material in PDF reader
  3. 2nd person manually correct OCR errors in highlighted text
  4. add processed PDFs to AP

(2) We need manually scan though PDF documents before deliver to user

Reference:
detect if it's scanned pdf
http://blogs.adobe.com/acrolaw/2010/06/how-can-i-detect-if-a-pdf-needs-to-be-ocrd/

OCR correctness
http://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-acrobat/

@ningyifan ningyifan added the bug label Oct 23, 2017
@ningyifan ningyifan self-assigned this Oct 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant