Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve validation rules for XSD integer datatypes #142

Open
tpluscode opened this issue Dec 1, 2021 · 8 comments
Open

Improve validation rules for XSD integer datatypes #142

tpluscode opened this issue Dec 1, 2021 · 8 comments
Labels
SHACL 1.1 Marks items that are reasonably significant changes for a potential SHACL 1.1

Comments

@tpluscode
Copy link

What

With SHACL 1.0 the sh:datatype constraint mandates exact match. That is, given a property shape with sh:datatype xsd:int, it will fail the check against the a value ""10"^^xsd:integer even though the type xsd:int is subtype of xsd:integer

As a reminder, here's the XSD types hierarchy from https://www.w3.org/TR/xmlschema-2/

type-hierarchy

Why

This is problematic because it turns out that multiple triplestore implementations will proactively "normalise" integer types on insert. As above, inserting "10"^^xsd:int will in fact store "10"^^xsd:integer in the database. I tried Fuseki (TDB1), Stardog and Allegrograph. It seems that all integer and derived datatypes will treated this way as long as their lexical form matches the given datatype. "-10"^^xsd:negativeInteger becomes "-10"^^xsd:integer etc.

This makes it impossible to roundtrip values created by a shape as shown below because when the resource comes back from the store, the type of schema:age will no longer match that of the property shape

<>
  a sh:property [
    sh:path schema:age ;
    sh:datatype xsd:nonNegativeInteger ;
  ] .

Proposal

I propose to prescribe to shacl processors that they should also "normalise" the property shapes but deconstructing xsd integer datatypes to xsd:integer and appropriate min/max as defined by XML Schema (unless explicitly provided by the shape itself)

sh:datatype default sh:minInclusive default sh:maxInclusive
xsd:long -9223372036854775808 9223372036854775807
xsd:int -2147483648 2147483647
xsd:short -32768 32767
xsd:byte -128 127
xsd:nonPositiveInteger none 0
xsd:negativeInteger none -1
xsd:nonNegativeInteger 0 none
xsd:positiveInteger 1 none
xsd:unsignedLong 0 18446744073709551615
xsd:unsignedInt 0 4294967295
xsd:unsignedShort 0 65535
xsd:unsignedInt 0 255
@HolgerKnublauch
Copy link
Contributor

Thinking out loud, I expect it would be hard for SHACL 1.1 to break SHACL 1.0 semantics. So for this to work without breaking the existing logic I guess SHACL 1.1 could introduce a second, optional flag for sh:DatatypeConstraintComponent to also accept, e.g. xsd:int when xsd:integer was requested.

So maybe sh:datatype xsd:integer ; sh:datatypeNormalization true ;

Furthermore, triple stores could indicate whether they can round-trip datatypes correctly and if not the SHACL engine can set the above flag true as the default.

@tpluscode
Copy link
Author

I don't necessarily agree that it's breaking 1.0 semantics for you cannot break what's already broken.

I think what stores do make sense from the perspective of XSD and robustness principle. All types derived from xsd:integer are essentially that plus range constraints. That is pretty much XSD's definition. Being "liberal in what we accept" means a validator should treat "10"^^xsd:integer as a valid xsd:byte

Thus, I conclude that the current behaviour of SHACL is flawed and deserves as minor version fix.

Furthermore, triple stores could indicate whether they can round-trip datatypes correctly and if not the SHACL engine can set the above flag true as the default.

I don't think this is practical. A web app which dereferenced the resource and shape has no idea whatsoever about backing triplestores

@afs
Copy link

afs commented Dec 2, 2021

Useful table.

A second flag to modify the constraint is better IMO. sh:datatypeByValue ("normalization" is usually more like a change to canonical form "0001" and "1" but same datatype).

As this does work today in many places, if not all, and some systems may currently wish to validate the exact datatype used (xsd:nonNegativeInteger is meaningful as it isn't due to the arbitrary 32 bit/64 bit issues), I don't think it can be classified as "broken".

The majority user feedback on TDB1 was to keep the datatype. Normalizing the lexicial part was just accepted.

@tpluscode
Copy link
Author

I think I will stand by "broken" since the base semantics of datatypes is defined by XSD spec. The stores will turn derived literal with XSD types to xsd:integer because that is what they are (the literals, not store). Of course, not everyone necessarily uses triple stores at all but it will be a major use case. And TDB is not the only implementation which does that.

Like I said, if I create a graph and it does not validate after round tripping from my database then this is a bug, not a feature. But I hear you, if the concern is breaking users depending on the current spec, then I'm probably proposing a change for SHACL 2.0 :)

Ah, and you make a good point about comparing the lexical form too. We could consider more holistic improvements, for example also to sh:hasValue. Consider https://s.zazuko.com/2boDM, where sh:hasValue 1 fails to validate "0001"^^xsd:integer

@HolgerKnublauch
Copy link
Contributor

HolgerKnublauch commented Dec 2, 2021

@tpluscode well, if the round-tripping does not work then it's a problem that the graph stores need to solve, not languages like SHACL downstream. I guess other systems like OWL (with its xsd:nonNegativeIntegers everywhere) would have similar issues. Fix it at the lowest possible level.

I like the suggested sh:datatypeByValue property.

@TallTed
Copy link
Member

TallTed commented Dec 3, 2021

[@tpluscode] given a property shape with sh:datatype xsd:int, it will fail the check against the a value ""10"^^xsd:integer even though the type xsd:int is subtype of xsd:integer

I'm surprised no-one else flagged this, but I think your example is backwards.

That is, even were subtype inferences active, it would be correct for ""10"^^xsd:integer to fail against sh:datatype xsd:int, because while all xsd:int are xsd:integer, some xsd:integer are not xsd:int, as that is the definition of a subtype.

On the other hand, subtype inferencing might enable ""10"^^xsd:int to succeed against sh:datatype xsd:integer.

@afs
Copy link

afs commented Dec 4, 2021

@TallTed I see your point but it can also be seen as a use case for validation by value.

F&O does not treat derived types of integers specially and any operation returns xsd:integers, "0"^^xsd:int+"10"^^xsd:int is "10"^^xsd:integer.

@HolgerKnublauch HolgerKnublauch added the SHACL 1.1 Marks items that are reasonably significant changes for a potential SHACL 1.1 label Feb 12, 2024
@VladimirAlexiev
Copy link

VladimirAlexiev commented Dec 3, 2024

I think stores that do not return exactly the datatypes written to them are "broken". Surely there are some scenarios (IoT, constrained sensors) where infinite length integers are not implemented, and you want to talk to them in terms of bytes or ints.
If all whole numbers are "upgraded" to integer by the store, then you need some extra layer to downgrade those to bytes or ints before sending to the constrained device.

@tpluscode argues about byte and int being sub-datatypes of integer. But that itself is a sub-datatype of decimal. So why the same stores don't further "normalize" all integers to decimals?

XSD had good reasons to define all these varieties of numbers. For a much more unpleasant alternative, look at JSON and schema.org that have just one badly underspecified "number".

So I think that stores that collapse part of the XSD hierarchy on save, are doing the wrong thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SHACL 1.1 Marks items that are reasonably significant changes for a potential SHACL 1.1
Projects
None yet
Development

No branches or pull requests

5 participants