Skip to content

Using the document ingestor

Joe Cabrera edited this page Jan 13, 2015 · 6 revisions

Initialization

To start using the library, first import Ingestor and Edgar

from ingestor import Ingestor, Edgar

Edgar (US) and Sedar (Canada) are currently supported. Note however that the flows are a bit different. See the note at the bottom about the Sedar flow.

Simple Download Workflow

First specific what kind of files using the new Edgar basic object

ingestor = Ingestor()
edgar = Edgar("xbrl")

xbrl or html are currently supported

Then pass ingest_stock() with a stock ticker to ingest and a directory to store the downloaded docs into file_downloader()

ingestor.file_downloader(edgar.ingest_stock("AAPL"), downloaded_docs_directory)

Sedar Download Workflow Note

The Sedar workflow is very similar to the Edgar workflow except that you will see a browser windo w launched. This is to capture cookies. Once the browser is launched you will need to click on a document link on the page. This will open up a CAPTCHA window. Solve the CAPTCHA and then close all the browser windows. The downloader should proceed normally assumed you solved the CAPTCHA correctly.

Sedar PDF Conversion Note

Sedar has chosen to make the majority of their filing documents available via PDF. PyLucene does not support indexing PDF documents. So you will need to convert the PDF documents to text and then index this | text. A conversion script has been provide your convenience at https://github.com/greedo/DIY-FilingsResearch/blob/master/scripts/convert.sh You will need to install PDFMiner first. This can be done by a simple:

pip install pdfminer