Scripts for processing the Coauth dataset
To run these scripts requires setting up the original database in a Neo4j instance, running specific Cypher queries to extract the data into a workable format, then running the Python scripts in this repo to convert it into node and edge CSV files suitable for Multinet Girder.
- Download the Neo4j data dump:
curl -O -J
. - Unpack the data:
tar xzvf data_dblp.tar.gz
. - Load the data into Neo4j:
sudo neo4j-admin load --from=data_dblp/databases/ --force=true
. - Extract the node data:
CYPHERSHELL=cypher-shell USERNAME=neo4j PASSWORD=neo4j sh
for your local equivalents). Note that the final line of output from this script will be the name of a temp file containing the extracted node data, calledDUMPFILE
in the following step. - Run the Python script to convert the dumped data to CSV files:
python author.csv journal.csv conference.csv <${DUMPFILE}
. This will create filesauthor.csv
, andconference.csv
containing node data of those types. - Extract the edge data and pipe it through the Python script to convert it to
a CSV:
CYPHERSHELL=cypher-shell USERNAME=neo4j PASSWORD=neo4j sh | python author.csv journal.csv conference.csv >authorship.csv
(again substituting proper values forCYPHERSHELL
). This will create fileauthorship.csv
containing edge data between the records in the three node data files.
After running these steps you will have four files: author.csv
author node data; journal.csv
and conference.csv
containing publication node
data; and authorship.csv
containing edge data linking publications to authors via
an "authored by" relation.