-
Notifications
You must be signed in to change notification settings - Fork 0
Link Guess Workflow and Project
The Link Guesser is a parallel data link recommender that runs on High Performance Computers like the TWC Hercules machine and the CCNI. It reads in semantic datasets and searches for possible predicates that can be linked to Instance Hub. It currently looks for US States and wgs:lat and wgs:long information.
This page describes the current and planned workflow of the Link Guesser and how it will fit into the overall TWC LOGD data conversion workflow.
The first step is choosing the datasets to analyze with the Link Guesser. These should be
listed in the link-guesses retrieve.sh. This script will contain the list and is version controlled in the escience svn.
We will use the following URI as an example throughout this workflow discussion. This would be listed in the retrieve.sh script.
http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30
The second step is to obtain the turtle dump files. When invoked, the retrieve.sh script will find the (potentially compressed) turtle dump files by dereferencing the dataset URIs and following the void:dataDump predicate. These files go into the source/ directory of a new version directory, according to the csv2rdf4lod-automation directory conventions.
The following directories and file results when this step is done.
version/2012-Feb-20/source/data-gov-1000-2010-Aug-30.ttl.gz
The link guesser requires N-Triples format, but the data dumps are not hosted as N-Triples (because it is too verbose). So, the third step is to uncompress any compressed data dumps and convert them to N-Triples, storing the results in manual/
. The Link Guesser also needs the dataset URI, so we'll include a sibling file with the extension .sd_name
whose contents is just the string of the URI.
version/2012-Feb-20/manual/data-gov-1000-2010-Aug-30.ttl.nt
version/2012-Feb-20/manual/data-gov-1000-2010-Aug-30.ttl.nt.sd_name
The first step is having the Link Guesser analyze datasets. This is done by feeding the Guesser the following three inputs:
an N-triples LOGD dataset , , and the void.ttl file of that dataset (which comes from the CSV2RDF4LOD converter). Inputs:
- The data to analyze (as N-TRIPLES) (i.e.
http://logd.tw.rpi.edu/source/data-gov/file/1000/version/2010-Aug-30/conversion/data-gov-1000-2010-Aug-30.ttl.gz
) - The named graph of that dataset (e.g.
http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30
)
Write output to
automatic/data-gov-1000-2010-Aug-30.ttl.nt.void.ttl
The Guesser reports a list of predicates that match as ether a US State or a Lat/Long, and provide a score of how well it believes this predicate can be linked to Instance Hub. The Guesser expresses its guesses in RDF similar to the following:
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30>
# This is the dataset that we are analyzing for links.
a void:Dataset;
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses>;
# Our analysis will become a subset of the collection of link-guesses about data-gov/dataset/1000/version/2010-Aug-30
.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses>
a void:Dataset;
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>;
# This ------/\ is the dataset of link guesses that we just created for data-gov/dataset/1000/version/2010-Aug-30
.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>
a void:Dataset, conversion:LinkGuessDataset, conversion:MetaDataset;
dcterms:modified "2012-02-20T20:40:26-05:00"^^xsd:dateTime;
void:dataDump <http://logd.tw.rpi.edu/source/twc-rpi-edu/provenance_file/link-guesses/version/2012-Feb-20/automatic/data-gov-1000-2010-Aug-30.ttl.nt.void.ttl>
# This ---------/\ is the data file created by dom's super computer link guesser 2000.
.
<http://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/link-guesses/version/2012-Feb-20>
# This is the dataset of link guesses that we performed for _all_ datasets on 2012-Feb-20.
a void:Dataset;
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>;
# Our collection of _all_ link guesses on 2012-Feb-20 includes the same dataset that we put under
# _each_ of the datasets that we analyzed for links.
.
<http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/vocab/raw/state>
# We are adding a description directly to the predicate used in the dataset, so that it is easy to find guesses from it.
:hasLinkGuess <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20/guess/1>;
.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20/guess/1>
# We are naming our guesses within the scope of the original datasets (2012-Feb-20 is the version of our link guesses)
a :LinkGuess;
void:inDataset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20>;
# This --------/\ is the void:Dataset of guesses that is a void:subset of the original and our guess collections.
dcterms:dateTime "2012-02-04T23:00:00Z"
:dataset <http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/version/2011-Dec-16>;
:link_concept <http://dbpedia.org/class/yago/StatesOfTheUnitedStates>;
:confidence 85;
# ^--- these three properties are in the vocabulary of the link guesser.
prov:wasAttributedTo :link_guesser_2000;
.
:link_guesser_2000
a doap:Software;
dcterms:creator "Jesse"
dcterms:contributor "Greg";
dcterms:contributor "Dominic";
.
In this example, for the EPA dataset "Toxin Release Into The Atmosphere", the Guesser has identified the predicate http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/vocab/raw/state as possibly being able to link to the Instance Hub category of US States with a confidence of 85 out of 100.
Dominic checks out a working copy of onto hercules, runs retrieve.sh, then performs the link guessing algorithm, which puts files into the automatic/ directory of a new version.
Dominic svn commits version/2012-Feb-20/automatic/*
.
This void.ttl can now be loaded into the void graph in http://logd.tw.rpi.edu/sparql endpoint.
We use the normal csv2rdf4lod-automation process to publish the guesses.
To publish it, someone on gemini svn updates to get new guesses.
root@gemini:/mnt/raid/srv/logd/data/source/twc-rpi-edu/link-guesses/version/2012-Feb-20# cr-publish-cockpit.sh -w
creates publish/bin/virtuoso-load-twc-rpi-edu-link-guesses-2012-Feb-20.sh
and hosts http://logd.tw.rpi.edu/source/twc-rpi-edu/file/link-guesses/version/2012-Feb-20/conversion/twc-rpi-edu-link-guesses-2012-Feb-20.void.ttl
root@gemini:/mnt/raid/srv/logd/data/source/twc-rpi-edu/link-guesses/version/2012-Feb-20# publish/bin/virtuoso-load-twc-rpi-edu-link-guesses-2012-Feb-20.sh --meta
We can verify that the link guess metadata makes it into the graph <http://logd.tw.rpi.edu/vocab/Dataset>
by grabbing a guess URI from the file and getting its descriptions:
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?g ?p ?o
WHERE {
GRAPH ?g {
<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesses/2012-Feb-20/guess/10>
?p ?o
}
}
Now that this information is loaded into the endpoint, we can query for all datasets that have been analyzed by the Link Guesser, check if they should be linked, and then actually make the link in the dataset.
The second part will be done by a simple PHP script that will query the endpoint for all datasets that have been analyzed but have not yet been linked (this information of what enhancements have been made to a dataset is also found in the same void graph.)
TODO: make query
The script will display to the user all of the datasets that fit this description. The user can then choose a dataset, view the predicate(s) that have link potential and see a sample of the values for that predicate (if sample data for this dataset is loaded into the endpoint).
After a review of the information, the user can decide if this link should be made. The script will then modify the enhancement parameters for that dataset by adding a LinksVia using Instance Hub Category US States. The dataset can be converted again (by pulling the conversion trigger) and the links will now be explicit in the dataset.
When the user picks a dataset to enhance by linking to instance hub, the php script must know what the latest enhancement number is for this verisoned dataset. For example, in http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30 we need to have a way to know if this current version is raw, e1, e2, etc. We need to know this so we can find the param.ttl for this dataset, so we can create a new enhancement layer for this dataset. And we need to do this in an automatic way.
What currently need to be decided/built:
- The vocabulary for expressing this link potential from the Link Guesser into the void.ttl needs to be defined and agreed upon.
- The PHP script that displays and modifies the dataset needs to be written
We're setting these aside:
- predicates collide over multiple tables
- global is used instead of local (but we'd want it to be local b/c we just specified e3 with LinksVia) - implement CSV2RDF4LOD_PUBLISH_GLOBAL_ENHANCEMENTS_STANCE="top-down" or "bottom-up" (we are currently top-down)