Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing Documents: Non JSON-LD Content dropped #38

Open
JonathanJustavino opened this issue Apr 22, 2024 · 34 comments
Open

Storing Documents: Non JSON-LD Content dropped #38

JonathanJustavino opened this issue Apr 22, 2024 · 34 comments
Labels
enhancement New feature or request pick up first high priority question Further information is requested

Comments

@JonathanJustavino
Copy link

Currently, when saving a document, content that cannot be parsed to RDF in jena is dropped.
(e.g. JSON content that is not JSON-LD is dropped, before storing the file)
It would be nice, if the git part of the gstore stores the original document,
and any converted rdf content from the document is stored in the triple store.

@JonathanJustavino JonathanJustavino added the pick up first high priority label Apr 22, 2024
@manonthegithub
Copy link
Collaborator

Here we need to discuss, this may lead to problems in Databus, when someone would post invalid document this would still be saved to git then, not cool, so we need to develop rules or maybe some additional param (like is_rdf_data=true)

@holycrab13
Copy link
Contributor

Not a problem for Databus, all inputs are validated on the Databus side first before saving

@holycrab13
Copy link
Contributor

Invalid documents should be rejected (bad JSON syntax) but all JSON is somewhat JSON-LD - even if it's just an empty graph. A long JSON document might still contain 2-3 triples as JSON-LD, we can't lose the entire rest of the document though.

Accept only if parsable of course, but Jena will usually just ignore anything that isn't LD in any JSON. Triple store then holds all triples, git has the full docs with unmodified content.

This is a hard requirement for the Gstore to serve as a database for MOSS and the OEF, not all of their JSON content is JSON-LD. Also all documents need to be saved "as is" - so after the validation, the JSON document should be saved as it came and not the JSON-LD printed out of the model.

@manonthegithub
Copy link
Collaborator

Ahhh so you want to store only jsons? I thought just any kind of file... Yes there is also some postprocessing of the jsonlds, they are stored in minimised format and not containing the full context also, but just URI for it.

@manonthegithub
Copy link
Collaborator

So that still will be json/jsonlds... I see now... this does not seem to be a problem

@manonthegithub
Copy link
Collaborator

So this would look like that:
you have mixed json/jsonld, you want jsonld part saved to triple store and all together also to git.

if invalid json (and therefor the whole document is invalid) -> we reject the whole
if jsonld part is invalid (wrong syntax/parser error) -> we reject the whole (save nowhere)
if jsonld part is valid -> we save

This is how you want? @holycrab13

@JJ-Author
Copy link
Contributor

JJ-Author commented May 11, 2024

in case you integrate this please really make it configurable/switchable and non-default. g-store is supposed to be a graph store with a simple git history of the graphs in git - not a json store^^.

@manonthegithub i think they ask to you store the json in git as is - so not stripping non-ld content and not normalizing it.

in case you accept json file that does not contain any ld -> so leads to no triple
you need to think about the read (and delete ) api calls because this file is invisible from the sparql endpoint and also the read call would just return nothing. so how do you want to read the plain json @holycrab13 @JonathanJustavino?? i think you would need new api functions.

it is seems also interesting that there is no api function to get all files or at least the history of a file, but I guess we never needed it so far^^.

@JJ-Author JJ-Author added question Further information is requested enhancement New feature or request labels May 11, 2024
@manonthegithub
Copy link
Collaborator

@JJ-Author to see all files you can go to file browser which is included in gstore (it is at /file path)
the other calls are not there yet, true :)

@holycrab13
Copy link
Contributor

So this would look like that: you have mixed json/jsonld, you want jsonld part saved to triple store and all together also to git.

if invalid json (and therefor the whole document is invalid) -> we reject the whole if jsonld part is invalid (wrong syntax/parser error) -> we reject the whole (save nowhere) if jsonld part is valid -> we save

This is how you want? @holycrab13

Yes, that's it.

@JJ-Author either /file, but the /g path should also just return the document imo. This would probably require to handle JSONLD differently than any other RDF syntax

@manonthegithub
Copy link
Collaborator

manonthegithub commented May 14, 2024

Hmmmm.... I had some more thought over it... It seems like having this feature is a dirty hacky solution for some particular problem which does not really fit into concept of gstore, but just in current case it is easier to implement like this.
I mean gstore is not supposed to work with json differently as with the other formats, if we make this change we will have inconsistency in behaviour which will for sure lead to some problems in future.
I would discuss in more detail why you want this and what are possible alternative solutions,
my question is why can't we just convert this non-ld content to ld-content so that the consistency stays? (it may be some extension of DataId or some custom ontology)

@manonthegithub
Copy link
Collaborator

Why would you in the first place mix both formats is also not clear to me? Looks like a design issue.
I would red flag this kinda things and better recommend to the guys doing it to think again about the design.

@manonthegithub
Copy link
Collaborator

we could potentially implement a method which will allow to store non-RDF content in gstore too (just saving in git), but separately.
so that json or whatever stuff can be separated from RDF

@manonthegithub
Copy link
Collaborator

it can actually be the same method for save and read, just checking if its an rdf content or not and if yes then parse rdf and put to virtuoso, if not then just save to git and voila

@holycrab13
Copy link
Contributor

We have the case where the document is 20% RDF and the rest just JSON, so a hard separation won't work in this case. I think it would be best to save to git and then on the virtuoso side create a graph for the document and throw in whatever is parsable RDF. A non RDF document will just end up with an empty graph.

@holycrab13
Copy link
Contributor

holycrab13 commented May 15, 2024

Why would you in the first place mix both formats is also not clear to me? Looks like a design issue. I would red flag this kinda things and better recommend to the guys doing it to think again about the design.

While it's not great, it's still somewhat valid and all JSON-LD parsers can deal with it. I think there no real reason not to support it

@manonthegithub
Copy link
Collaborator

manonthegithub commented May 15, 2024

@holycrab13

We have the case where the document is 20% RDF and the rest just JSON, so a hard separation won't work in this case. I think it would be best to save to git and then on the virtuoso side create a graph for the document and throw in whatever is parsable RDF. A non RDF document will just end up with an empty graph.

I don't see a point why the separation won't work. Doesn't matter what % is which part, you don't do it manually.
In the first place I would ask the guys not mix the formats, do they have any reasoning about it? Could you ask the guys to send RDF? they can convert non-RDF fields to RDF.

While it's not great, it's still somewhat valid and all JSON-LD parsers can deal with it. I think there no real reason not to support it

It is valid in a sense you get from gstore. If you claim the format to be jsonld then only json-ld is saved. If you claim format to be json then you save json, but it is not parsed as json ld. It is just a coincidence that json-ld is also a valid json.

One more thing then I could recommend if they still want to do it is to have a field containing the whole json ld-document and other fields. Then you could easily take the json-ld part and save it to gstore and for json we can create a new method which allows to save non-rdf data.

like that:

{
bla: "bla",
bla2: "bla2",
jsonld: <here is full json-ld object>

}

In general, I just think we can find a better solution than mixing the formats. I am quite certain if we do it now we will get some issues in the future. Better just to support some new formats...

If abovemnetioned not possible. One of the possible workaround solutions to that could be to make up a new custom media type for that and require to specify it. something like application/jsonld+json, then potentially we can make a separate implementation for working with this documents which will not interfere with normal json-lds.

@holycrab13
Copy link
Contributor

holycrab13 commented May 22, 2024

" It is just a coincidence that json-ld is also a valid json." not a coincidence, this is in the definition. Mixing is done often, doing all JSON-LD is not viable for the client, since there's a LOT of json fields.
The formats ARE usually mixed, no need to make something up here. It does not interfere with anything JSON-LD, therefore I do not really see a problem with implementing it.

This is a hard requirement that we need for MOSS and DLR

@JJ-Author
Copy link
Contributor

JJ-Author commented May 23, 2024

i think persisting the json makes especially sense when you have an external json-ld context that might change over time.
then you can reapply the extended context later and rebuild/reload the graphs. we actually also had this use case in lod-geoss in the beginning but then the context became too complicated and we converted the json to rdf from code.
but "syncing" and versioning that external context is cool but adds another complexity (that is probably not needed but shows that json-ld indeed is by nature very different to all other rdf formats.

@JJ-Author
Copy link
Contributor

https://gstore-playground.tools.dbpedia.org/file/ does show an error. how do you list all files?

@manonthegithub
Copy link
Collaborator

manonthegithub commented Jun 10, 2024

I am still against converting gstore to docuement store, but here are proposed changes:

Key Changes

  1. Document Preservation: Save documents in their original format instead of converting to JSON-LD. (new)
  2. Automatic RDF Extraction: Extract RDF data upon saving and store it in Virtuoso. (was there)
  3. Change of API [Option 1] -> not supported by swagger (will not be well present in docs), but nicer:
    • Save Document and RDF:
      • Endpoint: /document/<path>
      • Method: POST
      • Action: Saves the document and extracts RDF to Virtuoso.
    • Retrieve Document:
      • Endpoint: /document/<path>
      • Method: GET
      • Action: Retrieves the raw document.
    • Retrieve RDF Graph of a document:
      • Endpoint: /graph/<path>
      • Method: GET
      • Action: Returns the extracted RDF graph.
  4. Change of API [Option 2] -> supported by swagger -> easier to use:
    • Save Document and RDF:
      • Endpoint: /document/save
      • Param: path = "<path>"
      • Method: POST
      • Action: Saves the document and extracts RDF to Virtuoso.
    • Retrieve Document:
      • Endpoint: /document/read
      • Param: path = "<path>"
      • Method: GET
      • Action: Retrieves the raw document.
    • Retrieve RDF Graph of a document:
      • Endpoint: /graph/read
      • Param: path = "<path>"
      • Method: GET
      • Action: Returns the extracted RDF graph.

@manonthegithub
Copy link
Collaborator

Option 2 for now seems better, we can switch to Option 1 if very much needed in future without much effort.

@kurzum
Copy link
Member

kurzum commented Jun 11, 2024

Here is a slightly different phrased version of Option 2.
I think, we can make the parameter ?uri but ?path+?prefix might also be ok, maybe better than ?repo+?path+?prefix

would look like this:
/graph/save?uri=https://databus.dbpedia.org/adrian1703/20news/talk.religion.misc/18828/dataid.jsonld

<style type="text/css"></style>

PURE GRAPH MODE   BEHAVIOUR    
graph/read ?uri=$GRAPHURI returns parsed version    
graph/save ?uri=$GRAPHURI parses POST body and commits parsed version as JSON-LD    
graph/delete ?uri=$GRAPHURI deletes from GIT and drops graph from triple store    
         
         
DOC MODE        
graph/read ?uri=$GRAPHURI returns parsed version    
graph/write   DEACTIVATED    
graph/delete   DEACTIVATED    
doc/save ?uri=$GRAPHURI extracts graph from body, but commits body as is    
doc/read ?uri=$GRAPHURI returns GIT version    
doc/delete ?uri=$GRAPHURI deletes from GIT and drops graph from triple store    

@manonthegithub
Copy link
Collaborator

manonthegithub commented Jun 11, 2024

it is repo + path + prefix, in previous message there are only changes, so repo and prefix they remain

Graph mode just won't be there, only doc mode. Graph mode is the current version.

Only uri won't work, because not pear then what part is prefix and repo. The uri can contain arbitrary many segments in the prefix part, path + prefix + repo is the best option

@kurzum
Copy link
Member

kurzum commented Jun 11, 2024

Ok, right, so the prefix, repo part is answered.
Open questions:

  • Would it be easy to implement a config option for pure graph mode, doc mode? Maybe we don't need it right now.
  • Does the Doc Mode require a file ending?
  • Does the content-type on post need to match the file ending?

@manonthegithub
Copy link
Collaborator

manonthegithub commented Jun 11, 2024

Would it be easy to implement a config option for pure graph mode, doc mode? Maybe we don't need it right now.

No it is different systems, different approaches to work with data, we do either one or the other, not both.
So we treat data as docs or graphs here is the choice.

Does the Doc Mode require a file ending?

Yes, it must be there, no other way to understand what kinda data it is (when reading by /graph/read), or we need to store metadata for that. If we decide to use metadata, then Content-Type may be used again.

Does the content-type on post need to match the file ending?

The content-type will be ignored, only Accept head for /graph/read will be used

@kurzum
Copy link
Member

kurzum commented Jun 12, 2024

Would it be easy to implement a config option for pure graph mode, doc mode? Maybe we don't need it right now.

No it is different systems, different approaches to work with data, we do either one or the other, not both. So we treat data as docs or graphs here is the choice.

Hm, really? the underlying functionality is the same. GIT and Virtuoso just accept data. I would implement it with two different servlet/scalatra implementations and different web.xml and swagger. Depending on which one you start you get pure graph or doc mode. My question was how difficult it was code wise and I think, we should only do the doc mode for now, but implement it in a way that we can add a different servlet implementation later.

Does the Doc Mode require a file ending?

Yes, it must be there, no other way to understand what kinda data it is (when reading by /graph/read), or we need to store metadata for that. If we decide to use metadata, then Content-Type may be used again.

Ok, so graph/read would need this as internal input to select the parser. SELECT ?g {GRAPH ?g {?s ?p ?o} } on virtuoso is probably not implemented currently.

Does the content-type on post need to match the file ending?

The content-type will be ignored, only Accept head for /graph/read will be used

I meant on POST and doc/save this is where content-type is/should be sent by the client. Posting "Content-type: text/turtle" to ?path=file.jsonld will throw an error then?
We would need a list of file endings to content-type then.

@manonthegithub
Copy link
Collaborator

I meant on POST and doc/save this is where content-type is/should be sent by the client. Posting "Content-type: text/turtle" to ?path=file.jsonld will throw an error then?
We would need a list of file endings to content-type then.

the answer is the same as before. It will be ignored, I understood what you meant.

Hm, really? the underlying functionality is the same. GIT and Virtuoso just accept data. I would implement it with two different servlet/scalatra implementations and different web.xml and swagger. Depending on which one you start you get pure graph or doc mode. My question was how difficult it was code wise and I think, we should only do the doc mode for now, but implement it in a way that we can add a different servlet implementation later.

This is just super weird running different services in the same container, I won't do that. If you want to keep old gstore, we need just to fork repo, or make a special branch, that is it.

Ok, so graph/read would need this as internal input to select the parser. SELECT ?g {GRAPH ?g {?s ?p ?o} } on virtuoso is probably not implemented currently.

we should have one single source of truth/data, not many, and so far it was git, not virtuoso, that is why we parse the document, and not query from virtuoso

@manonthegithub
Copy link
Collaborator

one problem which may occur in the future. When we get the same media type/extension, like json but several ways rdf is stored there, then we won't be able to detect the right parser just by extension, this will need extra information about the parser

@kurzum
Copy link
Member

kurzum commented Jun 12, 2024

I meant on POST and doc/save this is where content-type is/should be sent by the client. Posting "Content-type: text/turtle" to ?path=file.jsonld will throw an error then?
We would need a list of file endings to content-type then.

the answer is the same as before. It will be ignored, I understood what you meant.

Please answer with enough detail. It still sounds like you will implement a connection reset/connection time out. But I am asking about HTTP status code and what causes it, e.g. ".jsonld" in URI will trigger the use of JSON-LD parser, if body doesn't parse then 400 Bad Request" is that it?

Hm, really? the underlying functionality is the same. GIT and Virtuoso just accept data. I would implement it with two different servlet/scalatra implementations and different web.xml and swagger. Depending on which one you start you get pure graph or doc mode. My question was how difficult it was code wise and I think, we should only do the doc mode for now, but implement it in a way that we can add a different servlet implementation later.

This is just super weird running different services in the same container, I won't do that. If you want to keep old gstore, we need just to fork repo, or make a special branch, that is it.

we can make two docker containers. I totally don't care if this would be in different branches.

Ok, so graph/read would need this as internal input to select the parser. SELECT ?g {GRAPH ?g {?s ?p ?o} } on virtuoso is probably not implemented currently.

we should have one single source of truth/data, not many, and so far it was git, not virtuoso, that is why we parse the document, and not query from virtuoso

The main purpose of virtuoso is to query the graph data, it is hard for me to really think of the docs being the only way, we are allowed to use to get graph data. also a) it should be consistent, not eventually consistent, b) editing is on the doc, so there SSoT is not violated. even before in pure graph mode there were two synchronized SSoT which was the idea behind GSTORE.

@kurzum
Copy link
Member

kurzum commented Jun 12, 2024

one problem which may occur in the future. When we get the same media type/extension, like json but several ways rdf is stored there, then we won't be able to detect the right parser just by extension, this will need extra information about the parser

file endings are our convention any how as there are no standard file endings, just media-types. so doing a list "file-ending"-> "parser" on each deployment would be enough.

@manonthegithub
Copy link
Collaborator

Please answer with enough detail. It still sounds like you will implement a connection reset/connection time out. But I am asking about HTTP status code and what causes it, e.g. ".jsonld" in URI will trigger the use of JSON-LD parser, if body doesn't parse then 400 Bad Request" is that it?

Formulate then the question with enough detail, what exactly you want to know (e.g. status codes etc), mention everything. A am not reading your mind. Really annoying.

I don't know how is that not clear, just don't understand. I get really annoyed, as it looks like trolling.

Content-Type is ignored means it is not checked or used anywhere in the code. how is that not clear?
The extension will be used for determining the parser. how is that not clear?
Depending on the parser response and following process the status code will be determined. It is the same as it was before, nothing changes. What else do you need?

Please think a little bit with your own head before asking, or ask chat got to give explanations of my responses.

we can make two docker containers. I totally don't care if this would be in different branches.

It should be clear that I mean servlet container, not docker containers. Again please formulate things you want precisely. You did not want to ask is that is much work, but you actually want it now. So you just would like to keep both of them. Then just mention that explicitly.

Here we can also tag last commit so far in gstore and that's it for now.

file endings are our convention any how as there are no standard file endings, just media-types. so doing a list "file-ending"-> "parser" on each deployment would be enough.

every media type has standard file ending, some of them have several
this is for the future, atm there is no problem with that. If we make it a convention then it's not a problem anymore

@manonthegithub
Copy link
Collaborator

Would it be easy to implement a config option for pure graph mode, doc mode? Maybe we don't need it right now.

@kurzum I thought about this option a bit more, and this also actually makes sense, if we keep two frontends but same shared code base in deeper logic, this also works. Now I don't know what solution is actually better, forking/tagging or keeping both together... Both are actually valid. I will the main part first and then we may decide to have the second service as well as extra feature...

@manonthegithub
Copy link
Collaborator

#41
During development I found out that I am using the Scalatra in a bit non expected way when uploading documents, so to support binary data and really large files (or big load), we will have to change the api again for /document/save. we will need to use multipart form data parameters (that is how it is supposed to work with files in Scalatra..., I may investigate bit more on that, but looks like that). This won't be a major change in the code, but will affect api (using multipart instead of classic post).

manonthegithub added a commit that referenced this issue Jun 19, 2024
manonthegithub added a commit that referenced this issue Jun 25, 2024
manonthegithub added a commit that referenced this issue Jun 25, 2024
@manonthegithub
Copy link
Collaborator

Should work now, can be tested. @holycrab13 @JonathanJustavino

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pick up first high priority question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants