This project contains a utility for extracting academic citation identifiers.
NOTE: As one of its dependencies (Mediawiki-Utilities) requires Python 3 so does mwcites.
pip install mwcites
There's really only one utility in this package called mwcitations
.
$ mwcitations extract enwiki-20150112-pages-meta-history*.xml*.bz2 > citations.tsv
Documentation is provided $ mwcitations extract -h
.
Extracts academic citations from articles from the history of Wikipedia articles by processing a pages-meta-history XML dump and matching regular expressions to revision content. Currently supported identifiers include: * PubMed * DOI * ISBN * arXiv * ISSN Outputs a TSV file with the following fields: * page_id: The identifier of the Wikipedia article (int), e.g. 1325125 * page_title: The title of the Wikipedia article (utf-8), e.g. Club cell * rev_id: The Wikipedia revision where the citation was first added (int), e.g. 282470030 * timestamp: The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z * type: The type of identifier, e.g. pmid, pmcid, doi, isbn or arxiv * id: The id of the cited scholarly article (utf-8), e.g 10.1183/09031936.00213411 Usage: mwcites extract -h | --help mwcites extract <dump_file>... Options: -h --help Shows this documentation