-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SHACL regex patterns follow JSON Schema ones #517
Conversation
@@ -259,8 +259,8 @@ aas:BlobShape a sh:NodeShape ; | |||
sh:maxCount 1 ; | |||
sh:minLength 1 ; | |||
sh:maxLength 100 ; | |||
sh:pattern "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$" ; | |||
sh:pattern "^([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+([ ]*;[ ]*([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&'*+\\-.^_`|~0-9a-zA-Z])+|\"(([ !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([ !-~]|[\\x80-\\xff]))*\"))*$" ; | |||
sh:pattern "^[\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd\\U00010000-\\U0010ffff]*$" ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use what we use in jsonschema? Java can't handle this pattern
aas-core-codegen/test_data/jsonschema/test_main/aas_core_meta.v3/expected_output/schema.json
Line 265 in c8bf07d
"pattern": "^([\\x09\\x0a\\x0d\\x20-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll use the exactly same pattern as in Java.
c8bf07d
to
f3bb96d
Compare
@mhrimaz can you please double-check again? Now I made the SHACL schema exactly follow the Java regex dialect. |
@mristin Thanks for working on this. So just to recap, instead of |
I think I forgot to push the latest changes from my local repository. Fixing it tomorrow. |
f3bb96d
to
a7025ba
Compare
@mhrimaz can you please test now? I forgot to add files to the commit. The patterns are now exactly the same as the ones we used in Java. |
@mristin in java it worked fine. However, it seems that now the pattern is not python friendly. |
@mhrimaz you can not have a pattern that works both in Python and Java. Java uses If we use the same patterns as in JSON Schema, then they will not work in Python as Python uses UTF-32 to encode strings in memory, but we assumed UTF-16 encoding in JSON schema ( It probably makes most sense to pick Java as SHACL is mostly used by Java folks? |
We have SHACL validator in various platforms like python (pySHACL) and C# (dotNetRDF), ... I can't really say SHACL is only for java, but certainly more popular. I am not a regex expert, so if we should choose between java and python in this case, I would go with java. But it would be good to maybe have 2 different files, or atleast have a documentation about this somewhere. FYI, i tested it with Apache Jena validator and pySHACL. |
@mhrimaz can you please ask around a bit before we settle for Java? Btw., can Java handle patterns we used in JSON Schema? @sebbader-sap @sebbader what is your take on this? |
The schema that is in 92bcf9e works in python. However, I am not sure if the regex works as intended and covers every corner cases. You can take any valid/invalid example, concatenate it with the ontology, and then pass it to a validator. If you want to test in python
if you want to test in java, first download the Apache Jena Commands. You can invoke shacl which is within bin/bat folder:
For online validation you can use https://shacl-play.sparna.fr/play/validate or https://www.itb.ec.europa.eu/shacl/any/upload Github doesn't allow .ttl file, change the file format to ttl. Note that the SHACL has also another critical issue related to |
@mhrimaz wrote:
No, the schema does not cover the edge cases in Python. The regex pattern operates on UTF-16 strings, but Python encodes characters as UTF-32. Hence, if a string contains no character above 55,639, it should work (no smileys, no Korean characters etc.). For more info, see: https://en.wikipedia.org/wiki/Plane_(Unicode) Other than this, @mhrimaz, should I merge then? |
(
?) |
We included the regex pattern as-is from the input which caused problems with the regex engines as the patterns in the meta-model are written in a Python dialect (and assuming that the regex engine works on UTF-32 characters). However, most regex engines in the wild operating on SHACL (*e.g.*, Java SHACL validators) use UTF-16 to represent the text and do not support some parts of the Python regex dialect. For example, in the input meta-model specification, we omit the minimum bound 0 (*e.g.*, ``{,4}``), which breaks with the Java regex engine beneath the SHACL validator. Instead, with this patch, we parse the pattern from the specification and re-render it into the form that we also use in JSON Schema. We pick JSON Schema regex dialect as most SHACL validators in the wild can deal with it, in particular those based on Java as a platform. Hence, we decide to serve this user base with priority. Discovered in [aas-core-meta issue 342]. [aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
92bcf9e
to
3d99500
Compare
@mhrimaz once we fix #519 , we can update the SHACL and RDF schemas in https://github.com/admin-shell-io/aas-specs. |
We included the regex pattern as-is from the input which caused problems with the regex engines as the patterns in the meta-model are written in a Python dialect (and assuming that the regex engine works on UTF-32 characters). However, most regex engines in the wild operating on SHACL (e.g., Java SHACL validators) use UTF-16 to represent the text and do not support some parts of the Python regex dialect. For example, in the input meta-model specification, we omit the minimum bound 0 (e.g.,
{,4}
), which breaks with the Java regex engine beneath the SHACL validator.Instead, with this patch, we parse the pattern from the specification and re-render it into the form that we also use in JSON Schema. We pick JSON Schema regex dialect as most SHACL validators in the wild can deal with it, in particular those based on Java as a platform. Hence, we decide to serve this user base with priority.
Discovered in aas-core-meta issue 342.