pygetpapers

getpapers (https://github.com/petermr/openVirus/wiki/getpapers), the primary scraper that we've been using so far, is written in Java and requires Node.js to run. Driven by the problems of maintaining and extending the Node-based getpapers, we've decided to re-write the whole thing in Python and call it pygetpapers.

People

PMR
Ayush
Dheeraj
Shweata

Our Initial Plans

PMR: This project is well suited to a modular approach, both in content and functionality. For example, each target repo is a subproject and as long as the framework is well designed it should be possible to add repos independently. An important aspect (missing at the moment) is "how to add a new repo" for example.

Repos

EPMC

This is essential.

comment

The EPMC API is fairly typical. Does it correspond to a known standard?

action

Identify which API functionality is

MUST be included
MAY be useful
should NOT be included (there are many bibliographic fields we don't need.)

crossref

This is metadata from publishers. It's very variable. It may include abstracts but often does not.

action

What fields do we wish to retrieve?

Analyze the query. How much (a) semantic (b) syntactic overlap with EPMC

arXiv

Physics, maths, compsci preprints. non-semantic fulltext as PDF, word, TeX. No XML.

comments

Low priority for biosciences

getpapers features

Current options

-h, --help                output usage information
    -V, --version             output the version number
    -q, --query <query>       search query (required)
    -o, --outdir <path>       output directory (required - will be created if not found)
    --api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
    -x, --xml                 download fulltext XMLs if available
    -p, --pdf                 download fulltext PDFs if available
    -s, --supp                download supplementary files if available
    -t, --minedterms          download text-mined terms if available
    -l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)
    -a, --all                 search all papers, not just open access
    -n, --noexecute           report how many results match the query, but don't actually download anything
    -f, --logfile <filename>  save log to specified file in output directory as well as printing to terminal
    -k, --limit <int>         limit the number of hits and downloads
    --filter <filter object>  filter by key value pair, passed straight to the crossref api only
    -r, --restart             restart file downloads after failure

api

Do biorXiv or medrXiv have an API?.

log

Useful if the logfile can default to a child of CProject

supp

does this work for EPMC?

is it documented?

minedterms

does this work for EPMC? Is it useful?

is it documented?

restart

Has anyone used this? What does it do.

is it documented?

Requirements and Bugs to Fix (@ayush)

=====

These are much too general. Who contributed them? Please expand:

General API

What does this mean?

Sort the Date

Which date?

Specifically, download only the Review, Research etc.

This is an EPMC option (I think). What is the current query format? We will need to customise this for the user.

Add attributes for repository specific functions

This is too general

Add option to get raw files as well as files in format such as xml and pdfs

Which raw files? Does EPMC have an interface? Do we want these files? Why? What are they used for?

Convert XML papers in a user readable format.

Why? This is not part of getpapers. It is already done by ami

Specify a wordlist and then get the count of those words for each paper.

Out of scope. This is ami-search.

Requirements from PMR (not exhaustive)

default number of hits

motivation

getpapers had no default for number of hits (-k option). This often resulted in downloading the whole database. High priority

choice of cursor size

motivation

User should be able to set number of hits per page. This wasn't explicit in getpapers. May also be able to restart failed searches. Low priority.

query builder

motivation

The use of brackets and quotes can be confusing and lead to errors. It will also be useful when querying using a list of terms. Medium priority

Requirements from Ambreen H

Segregation of papers based on whether they are full text or not

motivation

Many of our tools require fulltext and it may be useful to exclude others.

comment

This might be done by simply sorting the papers based on their size (there might be a better way). This shall ensure the user knows which folder to open and what to expect.

It may be possible to exclude non-fulltext in the search.

Download supplemental datasets if available

motivation

Some papers have data mounted on the publisher's server ("supplemental data", "supporting information").

comment

getpapers has a --supp option. Does this do what we want?

Many papers reference data through links in the fulltext. This would require HTTP-request to download. They could vary a lot in size or number.

Should this be automatic or an interactive facility after the text downloads (e.g. in a dashboard).

pygetpapers

People

Our Initial Plans

Repos

EPMC

comment

action

crossref

action

arXiv

comments

getpapers features

api

log

supp

minedterms

restart

Requirements and Bugs to Fix (@ayush)

General API

Sort the Date

Specifically, download only the Review, Research etc.

Add attributes for repository specific functions

Add option to get raw files as well as files in format such as xml and pdfs

Convert XML papers in a user readable format.

Specify a wordlist and then get the count of those words for each paper.

Requirements from PMR (not exhaustive)

default number of hits

motivation

choice of cursor size

motivation

query builder

motivation

Requirements from Ambreen H

Segregation of papers based on whether they are full text or not

motivation

comment

Download supplemental datasets if available

motivation

comment

Clone this wiki locally