-
Notifications
You must be signed in to change notification settings - Fork 14
Scraping tools
Christopher James edited this page Aug 12, 2014
·
17 revisions
Listing of tools
- Sunlight Foundations Tools
- List here
- Other here
- cometdocs.com - ToS requires that you own what you upload, FWIW
- pdftoexcelonline.com- ToS just requires that you respect IP rights
- zamzar.com - ToS says to respect copyright
- tabula.nerdpower.org - scrapes tabular data to CSV or excel
- scraperwiki.com - not free, scrapes tabular data to excel
- PyPDFOCR (http://virantha.com/2013/07/22/pyocr-a-python-script-for-running-free-ocr-on-your-pdfs/) - PyPDFOCR (python script that uses google's tesseract ocr api). In it's conversion process, it produces HOCR files which should be machine readable.
- I will look for more here: PDF Scrapers on GitHub (nwhysel)