Skip to content

Shape Validation Algorithm

Eric Prud'hommeaux edited this page May 22, 2021 · 9 revisions

Mapping Triples to TripleConstraints

The specification defines satisfies but does not provide an efficient way to test that. The algorithm here relies on regular bag expressions (explained below). It is necessarily complex as it accepts arbitrarily complex expressions such as (:foo . | :bar .)+ or foaf:name . | foaf:givenName . + ; foaf:familyName.

Given an intentionally pathological shape

<S1> {
  (
    <p1> . ; # TC1
    <p2> .   # TC2
  | <p3> . ; # TC3
    <p2> .+  # TC4
  ) ;
  <p2> [1]   # TC5
}

and an input neighborhood

<n1> <p2> 1,  #T1
          2,  #T2
          3 ; #T3
     <p3> 4 . #T4

, we can ignore the value restrictions and create a mapping from TC -> T based on predicate alone, a la:

TC [T]
1
2 T1, T2, T3
3 T4
4 T1, T2, T3
5 T1, T2, T3

It's actually more efficient to record the mapping from T -> TC:

T [TC]
1 TC2, TC4, TC5
2 TC2, TC4, TC5
3 TC2, TC4, TC5
4 TC3

and pick exactly one assignment of T -> TC from the above list.

Possible assignments of T to TC can be restricted to those which satisfy the single-occurrence regular bag expression (SORBE) (12|34+)5. A regular bag expression can be satisfied by taking scanning the input for any matching symbol (vs. "regular" regular expression where the state machine accepts only the next token). Theorem 8.3 in http://researchers.lille.inria.fr/~staworko/papers/staworko-arxiv-shex.pdf shows a PTIME algorithm to evaluate SORBEs.

Now we "count" our way through the assignments and try to satisfy the SORBE:

T TC
1 TC2
2 TC2
3 TC2
4 TC3

produces the ordered) series: 2223, which doesn't match the SORBE.

T TC
1 TC4
2 TC2
3 TC2
4 TC3

produces the ordered) series: 4223, which doesn't match the SORBE.

...

T TC
1 TC4
2 TC4
3 TC5
4 TC3

produces the ordered) series: 3445, which matches the SORBE. A candidate solution is

<S1> {
  (
    <p1> . ; # TC1 -> ∅
    <p2> .   # TC2 -> ∅
  | <p3> . ; # TC3 -> T4
    <p2> .+  # TC4 -> T1, T2
  ) ;
  <p2> [1]   # TC5 -> T3
}

But wait, the values don't match. Reject that and keep going until we get to:

T TC
1 TC5
2 TC4
3 TC4
4 TC3

which produces also produces the (ordered) series: 3445, which matches the SORBE. The solution:

<S1> {
  (
    <p1> . ; # TC1 -> ∅
    <p2> .   # TC2 -> ∅
  | <p3> . ; # TC3 -> T4
    <p2> .+  # TC4 -> T2, T3
  ) ;
  <p2> [1]   # TC5 -> T1
}

satisfies all value expressions.

Clone this wiki locally