Skip to content
Ayush Garg edited this page Jan 14, 2021 · 17 revisions

getpapers (https://github.com/petermr/openVirus/wiki/getpapers), the primary scraper that we've been using so far, is written in Java and requires Node.js to run. Driven by the problems of maintaining and extending the Node-based getpapers, we've decided to re-write the whole thing in Python and call it pygetpapers.

People

  1. PMR
  2. Ayush
  3. Dheeraj
  4. Shweata

Our Initial Plans

PMR: This project is well suited to a modular approach, both in content and functionality. For example, each target repo is a subproject and as long as the framework is well designed it should be possible to add repos independently. An important aspect (missing at the moment) is "how to add a new repo" for example.

Requirements and Bugs to Fix

  • General API
  • Sort the Date
  • Specifically, download only the Review, Research etc.
  • Add attributes for repository specific functions
  • Add option to get raw files as well as files in format such as xml and pdfs
  • Convert XML papers in a user readable format.
  • Specify a wordlist and then get the count of those words for each paper.
Clone this wiki locally