Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SHACL regex patterns follow JSON Schema ones #517

Merged
merged 1 commit into from
Aug 29, 2024

Conversation

mristin
Copy link
Contributor

@mristin mristin commented Aug 22, 2024

We included the regex pattern as-is from the input which caused problems with the regex engines as the patterns in the meta-model are written in a Python dialect (and assuming that the regex engine works on UTF-32 characters). However, most regex engines in the wild operating on SHACL (e.g., Java SHACL validators) use UTF-16 to represent the text and do not support some parts of the Python regex dialect. For example, in the input meta-model specification, we omit the minimum bound 0 (e.g., {,4}), which breaks with the Java regex engine beneath the SHACL validator.

Instead, with this patch, we parse the pattern from the specification and re-render it into the form that we also use in JSON Schema. We pick JSON Schema regex dialect as most SHACL validators in the wild can deal with it, in particular those based on Java as a platform. Hence, we decide to serve this user base with priority.

Discovered in aas-core-meta issue 342.

@@ -259,8 +259,8 @@ aas:BlobShape a sh:NodeShape ;
sh:maxCount 1 ;
sh:minLength 1 ;
sh:maxLength 100 ;
sh:pattern "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$" ;
sh:pattern "^([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+([ ]*;[ ]*([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+|\"(([ !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([ !-~]|[\\x80-\\xff]))*\"))*$" ;
sh:pattern "^[\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd\\U00010000-\\U0010ffff]*$" ;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use what we use in jsonschema? Java can't handle this pattern

"pattern": "^([\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll use the exactly same pattern as in Java.

@mristin mristin force-pushed the mristin/Include-zeros-in-SHACL-regex-bounds branch from c8bf07d to f3bb96d Compare August 26, 2024 04:26
@mristin
Copy link
Contributor Author

mristin commented Aug 26, 2024

@mhrimaz can you please double-check again? Now I made the SHACL schema exactly follow the Java regex dialect.

@mristin mristin changed the title Inlcude zeros in SHACL regex bounds Make SHACL regex patterns follow Java Aug 26, 2024
@mhrimaz
Copy link

mhrimaz commented Aug 26, 2024

@mristin Thanks for working on this.
I can't spot any changes. I think the shacl-schema.ttl is not updated.

So just to recap, instead of "^[\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd\\U00010000-\\U0010ffff]*$"
we should have "^([\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

@mristin
Copy link
Contributor Author

mristin commented Aug 26, 2024

@mristin Thanks for working on this. I can't spot any changes. I think the shacl-schema.ttl is not updated.

So just to recap, instead of "^[\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd\\U00010000-\\U0010ffff]*$" we should have "^([\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

I think I forgot to push the latest changes from my local repository. Fixing it tomorrow.

@mristin mristin force-pushed the mristin/Include-zeros-in-SHACL-regex-bounds branch from f3bb96d to a7025ba Compare August 27, 2024 06:55
@mristin
Copy link
Contributor Author

mristin commented Aug 27, 2024

@mhrimaz can you please test now? I forgot to add files to the commit. The patterns are now exactly the same as the ones we used in Java.

@mhrimaz
Copy link

mhrimaz commented Aug 27, 2024

@mristin in java it worked fine. However, it seems that now the pattern is not python friendly.
This is what we have in JSON Schema:
"^([\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
This is what is generated now:
"^[\\x{09}\\x{0a}\\x{0d}\\x{20}-\\x{d7ff}\\x{e000}-\\x{fffd}\\x{10000}-\\x{10ffff}]*$"
I am not sure if these two patterns are identical or not, but it doesn't work in python:
image

@mristin
Copy link
Contributor Author

mristin commented Aug 27, 2024

@mhrimaz you can not have a pattern that works both in Python and Java. Java uses \x{AB}, \x{ABCD} and \x{ABCD0123} syntax, Python uses \xAB, \uABCD and \UABCD0123.

If we use the same patterns as in JSON Schema, then they will not work in Python as Python uses UTF-32 to encode strings in memory, but we assumed UTF-16 encoding in JSON schema (\uABCD\u0123 in JSON Schema equals \UABCD0123 in Python).

It probably makes most sense to pick Java as SHACL is mostly used by Java folks?

@mhrimaz
Copy link

mhrimaz commented Aug 27, 2024

We have SHACL validator in various platforms like python (pySHACL) and C# (dotNetRDF), ... I can't really say SHACL is only for java, but certainly more popular.

I am not a regex expert, so if we should choose between java and python in this case, I would go with java. But it would be good to maybe have 2 different files, or atleast have a documentation about this somewhere.

FYI, i tested it with Apache Jena validator and pySHACL.

@mristin
Copy link
Contributor Author

mristin commented Aug 27, 2024

@mhrimaz can you please ask around a bit before we settle for Java? Btw., can Java handle patterns we used in JSON Schema?

@sebbader-sap @sebbader what is your take on this?

@mristin
Copy link
Contributor Author

mristin commented Aug 27, 2024

@mhrimaz I generated the patterns now the same as in JSON Schema -- can you please test with the SHACL schema from 92bcf9e?

@mhrimaz
Copy link

mhrimaz commented Aug 27, 2024

The schema that is in 92bcf9e works in python. However, I am not sure if the regex works as intended and covers every corner cases.

You can take any valid/invalid example, concatenate it with the ontology, and then pass it to a validator.

If you want to test in python

pip3 install pyshacl
pyshacl -s .\shacl-schema.ttl -a -f turtle .\invalid.ttl

image

if you want to test in java, first download the Apache Jena Commands. You can invoke shacl which is within bin/bat folder:

<path to jena shacl script in bin or bat folder/shacl> validate -s shacl-schema-1.ttl -d invalid.txt

image

For online validation you can use https://shacl-play.sparna.fr/play/validate or https://www.itb.ec.europa.eu/shacl/any/upload

Github doesn't allow .ttl file, change the file format to ttl.
invalid.txt
maximal.txt

Note that the SHACL has also another critical issue related to DataSpecificationIec61360.
SHACL states that a rdf:type aas:LangStringDefinitionTypeIec61360 ; should have the following attribute:
<https://admin-shell.io/aas/3/0/LangStringDefinitionTypeIec61360/text>
However in samples it have <https://admin-shell.io/aas/3/0/AbstractLangString/text>. But this is another issue anyways.

@mristin
Copy link
Contributor Author

mristin commented Aug 28, 2024

@mhrimaz wrote:

The schema that is in 92bcf9e works in python. However, I am not sure if the regex works as intended and covers every corner cases.

No, the schema does not cover the edge cases in Python. The regex pattern operates on UTF-16 strings, but Python encodes characters as UTF-32. Hence, if a string contains no character above 55,639, it should work (no smileys, no Korean characters etc.).

For more info, see: https://en.wikipedia.org/wiki/Plane_(Unicode)

Other than this, @mhrimaz, should I merge then?

@mristin
Copy link
Contributor Author

mristin commented Aug 28, 2024

(
@mhrimaz can you please create another issue on aas-core-codegen for:

Note that the SHACL has also another critical issue related to DataSpecificationIec61360.
SHACL states that a rdf:type aas:LangStringDefinitionTypeIec61360 ; should have the following attribute:
https://admin-shell.io/aas/3/0/LangStringDefinitionTypeIec61360/text
However in samples it have https://admin-shell.io/aas/3/0/AbstractLangString/text. But this is another issue anyways.

?)

@mhrimaz
Copy link

mhrimaz commented Aug 28, 2024

Other than this, @mhrimaz, should I merge then?

@mristin Yes, now it is better than before, thanks a lot.

@mhrimaz can you please create another issue on aas-core-codegen for

I will.

We included the regex pattern as-is from the input which caused problems
with the regex engines as the patterns in the meta-model are written in
a Python dialect (and assuming that the regex engine works on UTF-32
characters). However, most regex engines in the wild operating on SHACL
(*e.g.*, Java SHACL validators) use UTF-16 to represent the text and do
not support some parts of the Python regex dialect. For example, in
the input meta-model specification, we omit the minimum bound 0
(*e.g.*, ``{,4}``), which breaks with the Java regex engine beneath
the SHACL validator.

Instead, with this patch, we parse the pattern from the specification
and re-render it into the form that we also use in JSON Schema. We pick
JSON Schema regex dialect as most SHACL validators in the wild can deal
with it, in particular those based on Java as a platform. Hence, we
decide to serve this user base with priority.

Discovered in [aas-core-meta issue 342].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
@mristin mristin force-pushed the mristin/Include-zeros-in-SHACL-regex-bounds branch from 92bcf9e to 3d99500 Compare August 29, 2024 06:23
@mristin mristin changed the title Make SHACL regex patterns follow Java Make SHACL regex patterns follow JSON Schema ones Aug 29, 2024
@mristin mristin merged commit e22ccae into main Aug 29, 2024
5 checks passed
@mristin mristin deleted the mristin/Include-zeros-in-SHACL-regex-bounds branch August 29, 2024 06:24
@mristin
Copy link
Contributor Author

mristin commented Aug 29, 2024

@mhrimaz once we fix #519 , we can update the SHACL and RDF schemas in https://github.com/admin-shell-io/aas-specs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants