PDF-mining

I was given about 600 PDF files of various formats and sources, and a CSV file bse_companies.csv that had names of about 7000 Indian companies.

Goal:

Go through each PDF file, extract names of the authors of the document.
extract the institution that authored this document, e.g., Kotak Equitiies Research, ICICI securities etc. In certain cases there will be no institution associated with the document, in that case mark it as "Others".
Extract the names of companies mentioned in each PDF.
If it is a company report, then get the broker recommendation - BUY/SELL etc
If it is a company report, get the Target Price of the stock.

Approach:

Extracted pagewise text from PDF files with pdfminer
Built Indian data of companies and names, finetuned dislim/BERT-based-NER with Indian data
Trained NER model to recognise ORG, PER, used it for Goals 1, 2, 3
Used fasttext with weighted words from text and title as seperate features and used it with unsupervised KNN clustering to identfify reports
Used regex on reports classified to get Goals 4, 5

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pagewise_extracted_pdfs_sample		pagewise_extracted_pdfs_sample
NER_on_pdfs.ipynb		NER_on_pdfs.ipynb
README.md		README.md
bse_companies.csv		bse_companies.csv
data_collector.py		data_collector.py
doc_classification.ipynb		doc_classification.ipynb
indian_orgs_data_scraped.csv		indian_orgs_data_scraped.csv
pdf_extractor.py		pdf_extractor.py
pdf_info_extracted.json		pdf_info_extracted.json
report_extracted.json		report_extracted.json

Provide feedback