Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datatype inference #5

Open
VladimirAlexiev opened this issue Mar 26, 2024 · 4 comments
Open

datatype inference #5

VladimirAlexiev opened this issue Mar 26, 2024 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed working-group Issues to address in the WG

Comments

@VladimirAlexiev
Copy link

A recent paper:

R2RML and the original RML specification defined that RML processors can perform data type inference from the SQL databases. Thus, mappings did not have to specify rr:datatype for RDF Literals to have the correct data type as the processor would retrieve this automatically from the SQL database.
However, RML did not expand this to other heterogeneous datasources such as XML or JSON which both provide data types in different ways: XML schemas, native JSON types, etc. Data type inference is still under discussion but might be moved to RML-IO because this RML module focuses on accessing and iterating over the data source.

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

Here are a couple of considerations:

  • For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.
    • XML attributes and text content are always strings, so there's no place for implicit types, right?
    • One can specialize XSD types using restrictions and extension, which is potentially mappable to rdfs:Datatype constructs, but I think this is clearly beyond scope of RML
    • XSD and RelaxNG have the concept of "post schema validation infoset" (PSVI) that can assign application types (eg Person) to elements. However, I don't think we should go there.
  • for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.
@DylanVanAssche
Copy link

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

In that issue there's a discussion on where the test cases must be as the data type extraction from the data sources like SQL is mentioned in the Core spec while it might be better in the IO spec.

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.

I agree here. The question is how implementation should extract this given that XML can have separate XSD schemas etc.

for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

Interesting... I wonder why we cannot indicate a number as double int for integers and doubles for floating point numbers?
JSON has a native number type, but maybe it does not differentiates between float/integer here?

@DylanVanAssche DylanVanAssche added enhancement New feature or request help wanted Extra attention is needed labels Mar 26, 2024
@VladimirAlexiev
Copy link
Author

@DylanVanAssche Correct: JSON has just "number".

@pmaria
Copy link
Collaborator

pmaria commented Mar 27, 2024

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

Yes, the discussion is somewhat hidden, but "natural mapping of values" is definitely being discussed.
The proposed plan is to introduce separate documents per reference formulation wherein this can be specified.

See:

Here are a couple of considerations:

  • For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.

    • XML attributes and text content are always strings, so there's no place for implicit types, right?
    • One can specialize XSD types using restrictions and extension, which is potentially mappable to rdfs:Datatype constructs, but I think this is clearly beyond scope of RML
    • XSD and RelaxNG have the concept of "post schema validation infoset" (PSVI) that can assign application types (eg Person) to elements. However, I don't think we should go there.
  • for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

Thanks for this. Once we have specified this it would be great to have some review from you and other experts in the community on this @VladimirAlexiev.

@DylanVanAssche DylanVanAssche added the working-group Issues to address in the WG label Jun 19, 2024
@DylanVanAssche
Copy link

@bjdmeest Shouldn't this be moved to rml-io-registry?

@DylanVanAssche DylanVanAssche transferred this issue from kg-construct/rml-io Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed working-group Issues to address in the WG
Projects
None yet
Development

No branches or pull requests

3 participants