You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I highlight and bring text into the annotation, some of them are bunched up i.e., there are no spaces between words.
Example: Wiley Kwan_1999_9987702, all quote has no space that been saved in Elasticsearch. As comparison, PDF Aldridge 2001 article (from Wiley) keep spaces in the same line but can't interpret return line char (the issue #27 ).
From Amy:
For example, Wiley articles: Aldridge, Andrus, Knudsen, Odishaw, Robertson, and Simonson 2005 run words together a few times, but all annotations for Parra, and Kwan have no spaces in between any words. Dixon had no spaces either and highlighted very oddly (blue highlights very broken up - not a solid blue highlight line like the others) so I did not save it
Analysis: Reason:
It caused by pdf.js can't handle white space in scanned PDF and will skip return line character in mouse gripping. Action:
(1) OCR all scanned PDF would work. Missing return line char will be fixed at mean time.
Kwan_1999 works good after OCR
Awni_1995 is scanned book that not able to annotate part of article
Issues
In some cases, OCR may incorrectly interpret content in visually hard to read document
ex. Awni_1995
Zileuton (Ahhotr-64077) is a potent inhibitor of leukotriene biosynthesis (original)
Zileuton (Ahhotr-64077) is cl potent inhibitor of leukotriene bio.,ynthesis (OCR)
ex. Kwan_1999
The concentration of the (R)-and (S)-enantiomers of warfarin in the serum (original)
The concentration of the {R)-and (S)-enantiomers of warfarin in the serum (OCR)
Workflow:
OCR scanned PDFs
Annotator highlight claim, data, material in PDF reader
2nd person manually correct OCR errors in highlighted text
add processed PDFs to AP
(2) We need manually scan though PDF documents before deliver to user
When I highlight and bring text into the annotation, some of them are bunched up i.e., there are no spaces between words.
Example: Wiley Kwan_1999_9987702, all quote has no space that been saved in Elasticsearch. As comparison, PDF Aldridge 2001 article (from Wiley) keep spaces in the same line but can't interpret return line char (the issue #27 ).
From Amy:
For example, Wiley articles: Aldridge, Andrus, Knudsen, Odishaw, Robertson, and Simonson 2005 run words together a few times, but all annotations for Parra, and Kwan have no spaces in between any words. Dixon had no spaces either and highlighted very oddly (blue highlights very broken up - not a solid blue highlight line like the others) so I did not save it
Analysis:
Reason:
It caused by pdf.js can't handle white space in scanned PDF and will skip return line character in mouse gripping.
Action:
(1) OCR all scanned PDF would work. Missing return line char will be fixed at mean time.
Kwan_1999 works good after OCR
Awni_1995 is scanned book that not able to annotate part of article
Issues
In some cases, OCR may incorrectly interpret content in visually hard to read document
ex. Awni_1995
Zileuton (Ahhotr-64077) is a potent inhibitor of leukotriene biosynthesis (original)
Zileuton (Ahhotr-64077) is cl potent inhibitor of leukotriene bio.,ynthesis (OCR)
ex. Kwan_1999
The concentration of the (R)-and (S)-enantiomers of warfarin in the serum (original)
The concentration of the {R)-and (S)-enantiomers of warfarin in the serum (OCR)
Workflow:
(2) We need manually scan though PDF documents before deliver to user
Reference:
detect if it's scanned pdf
http://blogs.adobe.com/acrolaw/2010/06/how-can-i-detect-if-a-pdf-needs-to-be-ocrd/
OCR correctness
http://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-acrobat/
The text was updated successfully, but these errors were encountered: