Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load a gzipped N-Triples file? #47

Closed
wouterbeek opened this issue Jan 16, 2017 · 30 comments · May be fixed by #186
Closed

How to load a gzipped N-Triples file? #47

wouterbeek opened this issue Jan 16, 2017 · 30 comments · May be fixed by #186
Milestone

Comments

@wouterbeek
Copy link
Contributor

On branch stable I am able to create an HDT file from a gzipped N-Triples file (this probably uses Raptor). On branch master this no works (this probably uses Serd).

What is the best way to create an HDT file from a gizpped N-Triples file on master? Could the library be taught to do the right thing automatically?

@RubenVerborgh
Copy link
Member

On stable, this probably uses the old built-in N-Triples parser, which was severely broken, so it was replaced by SERD in bc7a258.

After #25, a gzipped pipeline should be possible again (and we can then remove the gzip dependency from hdt-cpp).

@mielvds
Copy link
Member

mielvds commented Jan 16, 2017

I would also prefer the piping option. However, #25 does not provide any actual stdin support, only enables it. Any volunteers?

@RubenVerborgh
Copy link
Member

We should probably extend #25 so that stdin is in scope.

@wouterbeek
Copy link
Contributor Author

I see now that branch stable no longer exists. Does this mean that master is now the stable branch? The most generic option would be to have a function that loads N-Triples/N-Quads (simply skipping the graph argument, if present) data from a C++ input stream.

Everything else -- compression, stdin, custom C++ code calling HDT -- could be build around that.

@RubenVerborgh
Copy link
Member

As per #44, master is the old stable branch and develop is where development happens.

The problem with the C++ input stream, if I remember correctly, is that SERD uses a C mechanism that was not directly compatible. Is that right, @joachimvh?

@wouterbeek
Copy link
Contributor Author

@RubenVerborgh Thanks for making master the stable branch.

@joachimvh It would be great if Somebody™ could fix the input stream issue. Serializing and/or unpacking files prior to starting the HDT generation is a big performance hit.

@joachimvh
Copy link
Contributor

I honestly don't remember anymore, would have to look into it again.

@mielvds mielvds added this to the Early 2017 milestone Jan 16, 2017
@mielvds mielvds modified the milestones: Late 2017, Early 2017 Feb 6, 2017
@MarioAriasGa
Copy link
Member

If you are using linux or mac, there is a trick called "process substitution" that allows you to create an anonymous FIFO pipe so the process thinks it is reading from a file:

$ rdf2hdt <( gzip -cd file.nt.gz ) out.hdt

I would love that serd supported gzip though :-)

@drobilla
Copy link
Contributor

Hello. I will have to take a look through your code to understand what's going on here better, but yes, serd uses standard C FILE* I/O.

As a minimal dependency-free library I don't think it would be appropriate to put gzip support into serd itself literally, however the API should make this possible. What I think might work is to add something like a serd_reader_start_custom_stream which allows the user to pass a callback for reading data. Would this work on your end?

Easiest route would be allowing the user to pass a function exactly like fread (which serd already uses) and a corresponding stream pointer. Then you could hook this up directly to zlib or do your own thing if necessary. I think this can be done without breaking the ABI.

@JanWielemaker
Copy link
Contributor

Allowing to pass in an fread like function with a void* handle providing the stream context should be enough to support adding gzip support for applications like the HDT library. I think that should be a simple clean solution from serd's point of view.

@MarioAriasGa
Copy link
Member

Hi, I agree with both of you. It's not worth complicating serd with libz, but having some abstraction like a custom read would enable anybody to push data directly to serd without intermediate files when using it as a library.

Thanks for your help! :-)

@LaurensRietveld
Copy link
Member

@MarioAriasGa : process subsitution doesn't seem to work with HDT. It generates a file that returns zero results for queries, and the header info indicates no triples are stored

With process substitution:

$ rdf2hdt  <( cat /tmp/1 ) test && hdtInfo test | grep void
<file:///dev/fd/63> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<file:///dev/fd/63> <http://rdfs.org/ns/void#triples> "0" .
<file:///dev/fd/63> <http://rdfs.org/ns/void#properties> "1622" .
<file:///dev/fd/63> <http://rdfs.org/ns/void#distinctSubjects> "1413" .
<file:///dev/fd/63> <http://rdfs.org/ns/void#distinctObjects> "1622" .

With a file:

$ rdf2hdt  /tmp/1 test && hdtInfo test | grep void
<file:///tmp/1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<file:///tmp/1> <http://rdfs.org/ns/void#triples> "1622" .
<file:///tmp/1> <http://rdfs.org/ns/void#properties> "1622" .
<file:///tmp/1> <http://rdfs.org/ns/void#distinctSubjects> "1413" .
<file:///tmp/1> <http://rdfs.org/ns/void#distinctObjects> "1622" .

@drobilla
Copy link
Contributor

drobilla/serd@1ae7934

@MarioAriasGa
Copy link
Member

@LaurensRietveld You are right, I think it's because the C++ version uses two passes, the Java one does it in one pass and should work.

@drobilla Awesome, I will give it a try :-)

@MarioAriasGa
Copy link
Member

I pushed the branch (serd-gzip with some initial code. It seems to work but it needs some testing.

Thanks again @drobilla ! :-)

@drobilla
Copy link
Contributor

Checking out your code I realized I botched the API slightly by still only allowing a bool to be passed for "paging" rather than an arbitrary size which is what makes sense for a custom sink function. The revised version is in drobilla/serd@52d3653

I updated your code to use the new API, but I was curious if the simpler gzread API would work (so there isn't three paging mechanisms going at once), and it seems fine. So I made that change, and... quite a few others, since this got a bit rabbit-holey as these things do :)

#61

I haven't done very extensive testing but things seem to work at first glance.

It would be quite nice if rdfhdt could take advantage of serd's streaming Turtle support in the other direction to do abbreviated dumps of very large data sets, but that's a task for another day... (and you can currently accomplish this by piping through serdi anyway, if more slowly).

@mielvds
Copy link
Member

mielvds commented Apr 25, 2017

Really nice work @drobilla !

@MarioAriasGa
Copy link
Member

Awesome work @drobilla, thanks very much for your help. I merged the changes :-D

@drobilla
Copy link
Contributor

Cool, you're welcome. I'll kick out a new release in a bit after checking things over.

@RubenVerborgh
Copy link
Member

Can we close this issue, @drobilla, @wouterbeek?

@drobilla
Copy link
Contributor

drobilla commented Jul 9, 2017

Seems fine to me.

@wouterbeek
Copy link
Contributor Author

I'm still not able to stream to HDT :( E.g., the following MWE does not return the echo'ed triple (tested on the master branch):

./rdf2hdt <( echo "<x:x> <y:y> <z:z> ." ) test.hdt && ./hdt2rdf test.hdt -

Is there something I'm missing? Are others able to stream to HDT?

@wouterbeek wouterbeek reopened this Oct 2, 2017
@LaurensRietveld
Copy link
Member

What I understood from @webdata was that rdf2hdt has to stream twice through the input file. Once for creating the dictionary, and once for the triples. It would explain why the above doesn't work (it only creates the dictionary)

@wouterbeek
Copy link
Contributor Author

wouterbeek commented Oct 2, 2017

@LaurensRietveld Does that also work with process substitution, as in the example given by @MarioAriasGa ? I'm not sure how to indicate to the command line tool that two passes are required.

BTW what's the status of the branch serd-gzip? Should this be merged into develop, and if so, would it make streaming through process substitution possible?

@mielvds
Copy link
Member

mielvds commented Oct 18, 2017

Can we adjust the README to reflect this?

@LaurensRietveld
Copy link
Member

I don't rdf2hdt to work in a streaming fashion (either with process substitution or regular pipes) when it has to do two runs through the file. So we should either support gzip in hdt (again), or we should support building an HDT file in one pass (if I understand correctly, that's what java does. Not sure about it's limitations and overhead though).
But, feel free to correct me if I'm way off here though ;)

@LaurensRietveld
Copy link
Member

LaurensRietveld commented Jan 6, 2018

Sorry for the rambling above, turns out that importing gzip does work (had to update a simple CLI check to get it to work: #151).

$ rdf2hdt input.nt.gz test.hdt && hdtInfo test.hdt | grep "void#triples"
<file:///input.nt.gz> <http://rdfs.org/ns/void#triples> "3500000" .

Considering this closed

@AxelPolleres
Copy link

While I note this is closed, any chance that ttl.bz2 could also be supported likewise?

$ rdf2hdt test.ttl.bz2 test.hdt
ERROR: Detected "bz2" input format. Must be one of:
	- n3
	- ntriples or nt
	- nquads or nq
	- turtle or ttl

@wouterbeek
Copy link
Contributor Author

The best thing would be if HDT would support arbitrary input streams. E.g., you would be able to do something like the following (without having to perform decompression from various formats from within the HDT codebase):

$ bunzip test.ttl.bz2 | rdf2hdt -o test.hdt

Unfortunately, the C++ implementation must stream through the data twice, so this streaming behavior in inherently difficult to implement ATM.

@webdata
Copy link
Contributor

webdata commented Mar 29, 2018

One could always adapt the one-pass importer from HDT-Java, it should be perfectly possible and will allow this streaming fashion. Anyone interested to code it in C++?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants