Skip to content

Characterizing a list of RDF node URIs

Tim L edited this page May 20, 2014 · 38 revisions

What is first

What we will cover

How informative can we make a very large list of RDF node URIs (really, rdfs:Resources and owl:Things)? Consider what we can do WITHOUT any edges between the nodes, having instead only the hierarchically-named nodes. URI design encourages semantics "between the edges", so what techniques can we use to benefit from well-designed URIs?

For example, how would you tackle a list of 1,061,789 RDF nodes? What about them would you want to know, to start to make sense of the ones you care about? What could you do without the triples that relate them? This implements a FAqT Service to determine some of the characteristics below:

  • Sort them lexiographically.

Context free characteristics

Contextual characteristics

  • Does it [not] contain the string "thing_"? If so, it is likely to be a raw conversion from csv2rdf4lod. (Similarly for other tools)
  • Does it [not] contain the string "hospital-compare"? If so, it is likely part of a particular dataset.

Relative characteristics

  • How many times does it occur in the list?
  • Look at the URI as a tree, how many total occurrences of each step of how many distinct values?
  • Sort them lexiographically and plot their lengths from small to large. Darken repeated URIs according to their frequency of occurrence. Highlight the points of high length derivative. Determine clusters of lengths and notice when they break.

This would be wavy if it were a steam graph because there is only one value 'provider' and more than one value a couple steps below it (e.g. {231313, 231314} and {'owner', 'telephone'}):

<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231313/owner>
<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231313/telephone>
<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231314/address>
<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231314/owner>

BTE vocabulary

Between The Edges RDF vocabulary prefix bte: http://purl.org/twc/vocab/between-the-edges/ (OWL):

  • bte:RDFNode rdfs:subClassOf rdfs:Resource .
  • bte:BTENode rdfs:subClassOf bte:RDFNode .
  • bte:broader rdfs:subPropertyOf skos:broader .
  • btw:HashURI contains a # character.
    • bte:HashEndURI rdfs:subClassOf bte:HashURI .(have # at end)
  • bte:SlashURI owl:disjointWith btw:HashURI .
    • bte:SlashEndURI rdfs:subClassOf bte:SlashURI . (have slash at the end)
  • bte:Domain rdfs:subClassOf btw:RDFNode .
    • bte:DotGov rdfs:subClassOf bte:Domain .
    • bte:Purl
  • bte:HTTPURI, bte:MailToURI, bte:TagURI, bte:URN, bte:DiURI

For the PrefixTree FileFormat, see https://github.com/timrdf/DataFAQs/wiki/BTE-Between-The-Edges#prefixtree

What is next

Clone this wiki locally