Skip to content

Link Guess Workflow and Project

timrdf edited this page Feb 21, 2012 · 33 revisions

The Link Guesser is a parallel data link recommender that runs on High Performance Computers like the TWC Hercules machine and the CCNI. It reads in semantic datasets and searches for possible predicates that can be linked to Instance Hub. I currently looks for US States and wgs:lat and wgs:long information.

This page describes the current workflow of the Link Guesser and how it will fit into the greater TWC LOGD data conversion workflow.

The first step is having the Link Guesser analyze datasets. This is done by feeding the Guesser the following three inputs:

an N-triples LOGD dataset , , and the void.ttl file of that dataset (which comes from the CSV2RDF4LOD converter). Inputs:

  • The data to analyze (as N-TRIPLES) (i.e. http://logd.tw.rpi.edu/source/data-gov/file/1000/version/2010-Aug-30/conversion/data-gov-1000-2010-Aug-30.ttl.gz)
  • The named graph of that dataset (e.g. http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30)

Write output to automatic/data-gov-1000-2010-Aug-30.ttl.nt.void.ttl

The Guesser reports a list of predicates that match as eether a US State or a Lat/Long, and provide a score of how well it believes this predicate can be linked to Instance Hub. The Guesser expresses its guesses in RDF similar to the following:

<http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30>
 void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1000/version/2010-Aug-30/meta/link-guesser>;
 void:dataDump <http://logd.tw.rpi.edu/source/twc-rpi-edu/file/1000/version/link-guesses/conversion/data-gov-1000-2010-Aug-30.ttl.gz> 
.
<http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/vocab/raw/state> 
  :hasLinkGuess [ a :LinkGuess;
     prov:wasAttributedTo   :link_guesser_2000;
     dcterms:dateTime "2012-02-04T23:00:00Z"
     :dataset <http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/version/2011-Dec-16>;
     :link_concept <http://dbpedia.org/class/yago/StatesOfTheUnitedStates>;
     :confidence 85;
  ];
.

:link_guesser_2000
   a doap:Software;
   dcterms:creator "Jesse"
   dcterms:contributor "Greg";
   dcterms:contributor "Dominic"; 

In this example, for the EPA dataset "Toxin Release Into The Atmosphere", the Guesser has identified the predicate http://logd.tw.rpi.edu/source/epa-gov/dataset/toxin_release_into_the_atmosphere/vocab/raw/state as possibly being able to link to the Instance Hub category of US States with a confidence of 85 out of 100.

This void.ttl can now be loaded into the void graph in http://logd.tw.rpi.edu/sparql endpoint.

Now that this information is loaded into the endpoint, we can query for all datasets that have been analyzed by the Link Guesser, check if they should be linked, and then actually make the link in the dataset.

The second part will be done by a simple PHP script that will query the endpoint for all datasets that have been analyzed but have not yet been linked (this information of what enhancements have been made to a dataset is also found in the same void graph.) The script will display to the user all of the datasets that fit this description. The user can then choose a dataset, view the predicate(s) that have link potential and see a sample of the values for that predicate (if sample data for this dataset is loaded into the endpoint).

After a review of the information, the user can decide if this link should be made. The script will then modify the enhancement parameters for that dataset by adding a LinksVia using Instance Hub Category US States. The dataset can be converted again (by pulling the conversion trigger) and the links will now be explicit in the dataset.

What currently need to be decided/built:

  • The vocabulary for expressing this link potential from the Link Guesser into the void.ttl needs to be defined and agreed upon.
  • The PHP script that displays and modifies the dataset needs to be written

Oustanding issues

We're setting these aside:

  • predicates collide over multiple tables
  • global is used instead of local (but we'd want it to be local b/c we just specified e3 with LinksVia) - implement CSV2RDF4LOD_PUBLISH_GLOBAL_ENHANCEMENTS_STANCE="top-down" or "bottom-up" (we are currently top-down)
Clone this wiki locally