datatype inference #5

VladimirAlexiev · 2024-03-26T14:39:13Z

A recent paper:

R2RML and the original RML specification defined that RML processors can perform data type inference from the SQL databases. Thus, mappings did not have to specify rr:datatype for RDF Literals to have the correct data type as the processor would retrieve this automatically from the SQL database.
However, RML did not expand this to other heterogeneous datasources such as XML or JSON which both provide data types in different ways: XML schemas, native JSON types, etc. Data type inference is still under discussion but might be moved to RML-IO because this RML module focuses on accessing and iterating over the data source.

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

Here are a couple of considerations:

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.
- XML attributes and text content are always strings, so there's no place for implicit types, right?
- One can specialize XSD types using restrictions and extension, which is potentially mappable to rdfs:Datatype constructs, but I think this is clearly beyond scope of RML
- XSD and RelaxNG have the concept of "post schema validation infoset" (PSVI) that can assign application types (eg Person) to elements. However, I don't think we should go there.
for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

The text was updated successfully, but these errors were encountered:

DylanVanAssche · 2024-03-26T15:22:32Z

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

In that issue there's a discussion on where the test cases must be as the data type extraction from the data sources like SQL is mentioned in the Core spec while it might be better in the IO spec.

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.

I agree here. The question is how implementation should extract this given that XML can have separate XSD schemas etc.

for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

Interesting... I wonder why we cannot indicate a number as double int for integers and doubles for floating point numbers?
JSON has a native number type, but maybe it does not differentiates between float/integer here?

VladimirAlexiev · 2024-03-27T05:54:01Z

@DylanVanAssche Correct: JSON has just "number".

pmaria · 2024-03-27T06:16:28Z

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

Yes, the discussion is somewhat hidden, but "natural mapping of values" is definitely being discussed.
The proposed plan is to introduce separate documents per reference formulation wherein this can be specified.

See:

Here are a couple of considerations:

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.

XML attributes and text content are always strings, so there's no place for implicit types, right?

One can specialize XSD types using restrictions and extension, which is potentially mappable to rdfs:Datatype constructs, but I think this is clearly beyond scope of RML

XSD and RelaxNG have the concept of "post schema validation infoset" (PSVI) that can assign application types (eg Person) to elements. However, I don't think we should go there.

for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

Thanks for this. Once we have specified this it would be great to have some review from you and other experts in the community on this @VladimirAlexiev.

DylanVanAssche · 2024-06-20T08:09:01Z

@bjdmeest Shouldn't this be moved to rml-io-registry?

DylanVanAssche added enhancement New feature or request help wanted Extra attention is needed labels Mar 26, 2024

DylanVanAssche added the working-group Issues to address in the WG label Jun 19, 2024

DylanVanAssche transferred this issue from kg-construct/rml-io Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datatype inference #5

datatype inference #5

VladimirAlexiev commented Mar 26, 2024

DylanVanAssche commented Mar 26, 2024

VladimirAlexiev commented Mar 27, 2024

pmaria commented Mar 27, 2024 •

edited

Loading

DylanVanAssche commented Jun 20, 2024

datatype inference #5

datatype inference #5

Comments

VladimirAlexiev commented Mar 26, 2024

DylanVanAssche commented Mar 26, 2024

VladimirAlexiev commented Mar 27, 2024

pmaria commented Mar 27, 2024 • edited Loading

DylanVanAssche commented Jun 20, 2024

pmaria commented Mar 27, 2024 •

edited

Loading