-
Notifications
You must be signed in to change notification settings - Fork 17
pygetpapers
getpapers
(https://github.com/petermr/openVirus/wiki/getpapers), the primary scraper that we've been using so far, is written in Java and requires Node.js
to run. Driven by the problems of maintaining and extending the Node-based getpapers
, we've decided to re-write the whole thing in Python and call it pygetpapers
.
- PMR
- Ayush
- Dheeraj
- Shweata
PMR: This project is well suited to a modular approach, both in content and functionality. For example, each target repo is a subproject and as long as the framework is well designed it should be possible to add repos independently. An important aspect (missing at the moment) is "how to add a new repo" for example.
This is essential.
The EPMC API is fairly typical. Does it correspond to a known standard?
Identify which API functionality is
- MUST be included
- MAY be useful
- should NOT be included (there are many bibliographic fields we don't need.)
This is metadata from publishers. It's very variable. It may include abstracts but often does not.
What fields do we wish to retrieve?
Analyze the query. How much (a) semantic (b) syntactic overlap with EPMC
Physics, maths, compsci preprints. non-semantic fulltext as PDF, word, TeX. No XML.
Low priority for biosciences
Current options
-h, --help output usage information
-V, --version output the version number
-q, --query <query> search query (required)
-o, --outdir <path> output directory (required - will be created if not found)
--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml download fulltext XMLs if available
-p, --pdf download fulltext PDFs if available
-s, --supp download supplementary files if available
-t, --minedterms download text-mined terms if available
-l, --loglevel <level> amount of information to log (silent, verbose, info*, data, warn, error, or debug)
-a, --all search all papers, not just open access
-n, --noexecute report how many results match the query, but don't actually download anything
-f, --logfile <filename> save log to specified file in output directory as well as printing to terminal
-k, --limit <int> limit the number of hits and downloads
--filter <filter object> filter by key value pair, passed straight to the crossref api only
-r, --restart restart file downloads after failure
Do biorXiv
or medrXiv
have an API?.
Useful if the logfile can default to a child of CProject
does this work for EPMC?
is it documented?
does this work for EPMC? Is it useful?
is it documented?
Has anyone used this? What does it do.
is it documented?
=====
These are much too general. Who contributed them? Please expand:
What does this mean?
Which date?
This is an EPMC option (I think). What is the current query format? We will need to customise this for the user.
This is too general
Which raw files? Does EPMC have an interface? Do we want these files? Why? What are they used for?
Why? This is not part of getpapers. It is already done by ami
Out of scope. This is ami-search.
getpapers
had no default for number of hits (-k
option). This often resulted in downloading the whole database. High priority
User should be able to set number of hits per page. This wasn't explicit in getpapers
. May also be able to restart failed searches. Low priority.
The use of brackets and quotes can be confusing and lead to errors. It will also be useful when querying using a list of terms. Medium priority
Many of our tools require fulltext and it may be useful to exclude others.
This might be done by simply sorting the papers based on their size (there might be a better way). This shall ensure the user knows which folder to open and what to expect.
It may be possible to exclude non-fulltext in the search.
Some papers have data mounted on the publisher's server ("supplemental data", "supporting information").
getpapers
has a --supp
option. Does this do what we want?
Many papers reference data through links in the fulltext. This would require HTTP-request to download. They could vary a lot in size or number.
Should this be automatic or an interactive facility after the text downloads (e.g. in a dashboard).