-
Notifications
You must be signed in to change notification settings - Fork 14
Scraping tools
joelbcastillo edited this page Sep 16, 2014
·
17 revisions
Listing of tools
- Sunlight Foundations Tools
- List here
- Other here
- cometdocs.com - ToS requires that you own what you upload, FWIW
- pdftoexcelonline.com- ToS just requires that you respect IP rights
- zamzar.com - ToS says to respect copyright
- tabula.nerdpower.org - scrapes tabular data to CSV or excel
- scraperwiki.com - not free, scrapes tabular data to excel
- PyPDFOCR - This is a python script that uses Google's Tesseract OCR API to make scanned PDFs searchable. In it's conversion process, it produces HOCR files which should be machine readable.
- PDFMiner - Tool written in Python to extract information from PDF files. DORIS used this to extract the text from the text-selectable City Records (2008 - 2014)
- I will look for more here: PDF Scrapers on GitHub (nwhysel)