Scripts for processing the Coauth dataset
To run these scripts requires setting up the original database in a Neo4j instance, running specific Cypher queries to extract the data into a workable format, then running the Python scripts in this repo to convert it into node and edge CSV files suitable for Multinet Girder.
- Download the Neo4j data dump:
curl -O -J https://data.kitware.com/api/v1/file/5cc709cf8d777f072b766fbf/download
. - Unpack the data:
tar xzvf data_dblp.tar.gz
. - Load the data into Neo4j:
sudo neo4j-admin load --from=data_dblp/databases/ --force=true
. - Extract the node data:
CYPHERSHELL=cypher-shell USERNAME=neo4j PASSWORD=neo4j sh all-nodes.sh
(subtitutingCYPHERSHELL
,USERNAME
, andPASSWORD
for your local equivalents). Note that the final line of output from this script will be the name of a temp file containing the extracted node data, calledDUMPFILE
in the following step. - Run the Python script to convert the dumped data to CSV files:
python cypher2cnode.py author.csv journal.csv conference.csv <${DUMPFILE}
. This will create filesauthor.csv
,journal.csv
, andconference.csv
containing node data of those types. - Extract the edge data and pipe it through the Python script to convert it to
a CSV:
CYPHERSHELL=cypher-shell USERNAME=neo4j PASSWORD=neo4j sh all-links.sh | python cypher2edge.py author.csv journal.csv conference.csv >authorship.csv
(again substituting proper values forCYPHERSHELL
,USERNAME
, andPASSWORD
). This will create fileauthorship.csv
containing edge data between the records in the three node data files.
After running these steps you will have four files: author.csv
containing
author node data; journal.csv
and conference.csv
containing publication node
data; and authorship.csv
containing edge data linking publications to authors via
an "authored by" relation.