Skip to content
Tim L edited this page Mar 11, 2015 · 43 revisions

In our COLD 2013 paper, we pointed to some future work for better techniques for crawling linked data:

In the example and demonstration presented in this paper, the property paths that we used in the deref function were determined manually using a relatively comprehensive understanding of the data that would be collected. To fully leverage Linked Data foraging in less controlled environments, more powerful techniques are required to 1) provide context-free overview+detail of arbitrary RDF data that has been accumulated and 2) empower the analyst to steer and throttle the automated foraging.

Although we gave an example related to limiting which domains to crawl...

For example, one forage execution from our demonstration retrieved Linked Data from the dx.doi.org domain. Although this provided valuable bibliographic information that could be used in related analyses, it was ancillary for the current task, invoked their servers unnecessarily, and cluttered the accumulated data. It will therefore be important in the future to control not only what kind of data should be accumulated, but also from where (or, correspondingly, from whom).

... we go in a slightly different direction with the Linked Data Auger. The Linked Data Auger can steer and constrain how to gather Linked Data based on what we need (and only what we need) for a specific view. And, it tackles restricting which domains to crawl as a simple case!

An example

The following histogram shows the number of satellites that each country owns. Clicking the png image below will link to a tiff version that is [content preserved](Content Preserving Graphics), i.e. it has 320 RDF triples embedded as an alternate representation of the satellite grouping. Thus, the file can be considered both 1-star and 5-star data, which we denote as 15-star.

The tallest bar, i.e. the count of 5,793 satellites owned by the Commonwealth of Independent States, has a (rather long) URI "bin 52". We can start crawling Linked Data from that single URI using this SPARQL query to gather the 3,281 triples necessary for the following view:

Original data

The histograms were created from agi-com-satellite-database-2014-May-03.ttl.gz.

Related work