Skip to content
Christopher James edited this page Aug 12, 2014 · 17 revisions

The following is a collection of tools that are known to scrape text selectable PDFs.

Listing of tools

  • Sunlight Foundations Tools
  • List here
  • Other here
  • cometdocs.com - ToS requires that you own what you upload, FWIW
  • pdftoexcelonline.com- ToS just requires that you respect IP rights
  • zamzar.com - ToS says to respect copyright
  • tabula.nerdpower.org - scrapes tabular data to CSV or excel
  • scraperwiki.com - not free, scrapes tabular data to excel
  • [http://virantha.com/2013/07/22/pyocr-a-python-script-for-running-free-ocr-on-your-pdfs/] - PyPDFOCR (python script that uses google's tesseract ocr api). In it's conversion process, it produces HOCR files which should be machine readable.
  • I will look for more here: PDF Scrapers on GitHub (nwhysel)
Clone this wiki locally