-
Notifications
You must be signed in to change notification settings - Fork 3
Characterizing a list of RDF node URIs
How informative can we make a very large list of RDF node URIs (really, rdfs:Resources and owl:Things)? Consider what we can do WITHOUT any edges between the nodes, having instead only the hierarchically-named nodes. URI design encourages semantics "between the edges", so what techniques can we use to benefit from well-designed URIs?
For example, how would you tackle a list of 1,061,789 RDF nodes? What about them would you want to know, to start to make sense of the ones you care about? What could you do without the triples that relate them? This implements a FAqT Service to determine some of the characteristics below:
- Sort them lexiographically.
-
What protocol do they use?
- e.g.
http
,di
,file
,ftp
,tag
,urn
- e.g.
-
What domain are they in?
- e.g.
http://aspe.hhs.gov
,http://blast.ncbi.nlm.nih.gov
,change.CSV2RDF4LOD_BASE_URI.in-source-me.sh.localhost
,http://code.google.com
,http://dbpedia.org
,http://logd.tw.rpi.edu
,http://purl.bioontology.org
- e.g.
- Does it have a file extension? Does it end in a hash? Does it end in a slash?
- How long is the URI? How many slash steps does it have?
- Does it [not] contain the string "thing_"? If so, it is likely to be a raw conversion from csv2rdf4lod. (Similarly for other tools)
- Does it [not] contain the string "hospital-compare"? If so, it is likely part of a particular dataset.
- How many times does it occur in the list?
- Look at the URI as a tree, how many total occurrences of each step of how many distinct values?
- Sort them lexiographically and plot their lengths from small to large. Darken repeated URIs according to their frequency of occurrence. Highlight the points of high length derivative. Determine clusters of lengths and notice when they break.
This would be wavy if it were a steam graph because there is only one value 'provider' and more than one value a couple steps below it (e.g. {231313, 231314} and {'owner', 'telephone'}):
<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231313/owner>
<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231313/telephone>
<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231314/address>
<http://purl.org/twc/health/source/hub-healthdata-gov/provider/231314/owner>
- Does it have an "extension[less]" pair? e.g. http://purl.org/twc/health/void and http://purl.org/twc/health/void.ttl
- Does it group frag identifiers? e.g. <http://purl.org/twc/id/machine/lebot/MacBookPro6_2 http://purl.org/twc/id/machine/lebot/MacBookPro6_2#lebot
- What is its HTTP response [for Accept = rdf]?
Between The Edges RDF vocabulary prefix bte: http://purl.org/twc/vocab/between-the-edges/ (OWL):
- bte:RDFNode rdfs:subClassOf rdfs:Resource .
- bte:BTENode rdfs:subClassOf bte:RDFNode .
- bte:broader rdfs:subPropertyOf skos:broader .
- btw:HashURI contains a
#
character.- bte:HashEndURI rdfs:subClassOf bte:HashURI .(have
#
at end)
- bte:HashEndURI rdfs:subClassOf bte:HashURI .(have
- bte:SlashURI owl:disjointWith btw:HashURI .
- bte:SlashEndURI rdfs:subClassOf bte:SlashURI . (have slash at the end)
- bte:Domain rdfs:subClassOf btw:RDFNode .
- bte:DotGov rdfs:subClassOf bte:Domain .
- bte:Purl
- bte:HTTPURI, bte:MailToURI, bte:TagURI, bte:URN, bte:DiURI
For the PrefixTree FileFormat, see https://github.com/timrdf/DataFAQs/wiki/BTE-Between-The-Edges#prefixtree
- BTE Between The Edges at DataFAQs wiki
- Our notes on summarizing the Billion Triples Challenge use some techniques to summarize lists of URIs.