Tobias Kuhn, 23 February 2015, http://www.trustyuri.net
This document contains version 1 of the specification of the trusty URI approach.
The previous version of this document was version 0, which was preliminary and not stable: http://trustyuri.net/spec/v0.FA4BwXfTl2X-ABWKUF2k0T044yS2-KmO_R0zBftSsc96k.md
Version 1 described here is not backward compatible with version 0 for module
RA
(the handling of plain literals according to RDF 1.1 breaks compatibility).
Future versions, however, are supposed to be 100% backward compatible with this
version. This means that when the extension of modules cannot be done in a
backward compatible manner in the future, new modules will have to be defined
and old ones will have to be deprecated.
Hash values of trusty URIs are encoded in Base64 notation with some common modifications for making it safe to use them in URIs and file names:
Definition 1. Every character that is a standard ASCII letter (
A-Z
ora-z
), a digit (0-9
), a hyphen (-
), or an underscore (_
) is called a Base64 character, representing in this order the numbers from 0 to 63. There are no other Base64 characters.
Trusty URIs have the following structure:
Definition 2. Every trusty URI ends with at least 25 Base64 characters. The sequence of characters following the last non-Base64 character is called the artifact code. The first two characters of the artifact code are called the module identifier. The sequence of characters following the module identifier is called data part, which is identical to or contains a hash part.
The current modules only generate URIs with exactly 45 trailing Base64 characters, but the definition is kept open for future modules.
The first character of the module identifier specifies the type of the content and therefore the type of the module; the second character is a version identifier of the module. The main content of the data part is the hash value, but it can also contain other information such as parameters and sub-types. Its concrete structure depends on the module.
As everybody who has access to the respective domain is free to define and use URIs at will, we can only be sure that a certain URI is a trusty URI once we have found and verified a content that matches the hash. For that reason, the concept of a potential trusty URI needs to be introduced:
Definition 3. Every URI that could be a trusty URI according to the restrictions of Definition 2 with a module identifier matching a defined module and with a data part that is consistent with the structural restrictions of the given module (in particular with respect to its length) is called a potential trusty URI.
With these ingredients, trusty URIs can be verified:
Definition 4. Given a potential trusty URI and a digital artifact, if the identifier part refers to an module that returns a hash value for the digital artifact that is identical to the one encoded in the hash part, then the potential trusty URI is a verified trusty URI and the digital artifact is its verified content.
For convenience reasons, we can append a file extension like .txt
or .nq
to
trusty URIs. The resulting URIs are technically no trusty URIs anymore, but it
is easy to strip the extension and get the respective trusty URIs.
As the hash values are located in the final part of the URIs, it is straightforward to also use them in file names and to deal with them in a local file system without worrying about the first part of the URI. For example, the name of such a file could therefore be:
r1.RAcbjcRIQozo2wBMq4WcCYkFAjRz0AX-Ux3PquZZrC68s.nq
Such files are called trusty files.
There are currently three modules available: FA
, RA
, and RB
.
Version A of module type F, i.e. module FA
, works on the byte content of
files. A hash value is calculated using SHA-256 [SHA-256] on the content of the
file in byte representation. The file name and other metadata are not
considered. Two zero-bits are appended to the resulting hash value, and then
transformed to Base64 notation as defined above. The resulting 43 characters
make up the data part of the trusty URI.
Empty files, for example, get the following URI suffix:
FA47DEQpj8HBSa-\_TImW-5JCeuQeRkm5NMpJWZG3hSuFU
When adding such a suffix to a URI, it has to be made sure that it is preceded
by a non-Base64 character, such as a dot (.
), a slash (/
), or a hash sign
(#
). This applies to all modules.
Version A of module type R, i.e. module RA
, works on RDF content, possibly
covering multiple named graphs, relying on RDF version 1.1 [RDF].
This module allows for self-references, i.e. the trusty URI itself may appear in the RDF data it represents. URIs consisting of the given trusty URI and a suffix are also supported, such as:
http://example.org/r2.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70#Part1
http://example.org/r2.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70#Part2
Blank nodes are not supported and have to be skolemized in the same way when a trusty URI is produced. It is furthermore assumed that the data is a set of named RDF graphs. RDF triples without a named graph are considered to belong to a special named graph represented with the empty string.
To check whether a given artifact code correctly represents a given set of named graphs, the triples and graphs have to be sorted first. Because the trusty URI can appear in the RDF data it represents, all occurrences of the given artifact code in the URIs have to be replaced by a blank character in a preprocessing step. To determine the order of any two triples, the first applicable rule of the following list is applied:
- If their graph URIs differ, the triple with the lexicographically smaller preprocessed graph URI is first.
- If their subject URIs differ, the triple with the lexicographically smaller preprocessed subject URI is first.
- If their predicate URIs differ, the triple with the lexicographically smaller preprocessed predicate URI is first.
- If one has a literal as object and the other has a non-literal, the triple with the non-literal as object is first.
- If both have a URI as object, the triple with the lexicographically smaller preprocessed object URI is first.
- If the literal labels of the objects differ, the triple with the lexicographically smaller literal label is first.
- If one of the object literals has a datatype identifier and the other does not, the triple without a datatype identifier is first.
- If one of the object literals has a language identifier and the other does not, the triple without a language identifier is first.
- The triple with the lexicographically smaller datatype or language identifier is first.
The lexicographic order is defined on strings of Unicode characters. If two strings have different characters at at least one position, the string with the smaller integer value at the first differing position is first. Otherwise, the shorter string is first.
After the triples have been sorted, a sequence of Unicode characters s is
built. For each triple, the serialization of its graph, its subject, its
predicate, and its object are added to the end of s, in this order and with a
newline character at the end of each of the four. The serialization of graph,
subject, and predicate identifiers is simply their preprocessed URI string.
Objects that consist of a URI are treated the same way. Literals without a
language tag are serialized as a circumflex character (^
) followed by the
datatype URI (which, according to RDF 1.1, equals
http://www.w3.org/2001/XMLSchema#string
if not explicitly specified), a blank
space, and the escaped literal string. Literal strings are escaped by replacing
\
by \\
and newline characters by \n
. Literals with a language tag are
serialized as an at-sign @
followed by the canonicalized language string, a
blank space, and the escaped literal string.
The actual computation of the hash data is identical to Module F: a SHA-256 hash is generated for s in UTF-8 encoding, two zero-bits are appended, and the result is transformed to Base64 notation.
Version B of module type R, i.e. module RB
, is a slight variation of module
RA
. While module RA
can represent any number of RDF graphs, a trusty URI
using module RB
always represents just one graph. The calculation of the hash
is identical to the procedure described above, with the only restriction being
that all triples have the trusty URI of the given resource as their graph URI.
- [SHA-256] National Institute of Standards and Technology (NIST). "Secure Hash Standard (SHS)," FIPS PUB 180-4, March 2012. http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf
- [RDF] Richard Cyganiak, David Wood, and Markus Lanthaler. "RDF 1.1 Concepts and Abstract Syntax," W3C Recommendation, 25 February 2014. http://www.w3.org/TR/rdf11-concepts/