-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to load a gzipped N-Triples file? #47
Comments
I would also prefer the piping option. However, #25 does not provide any actual stdin support, only enables it. Any volunteers? |
We should probably extend #25 so that stdin is in scope. |
I see now that branch Everything else -- compression, stdin, custom C++ code calling HDT -- could be build around that. |
As per #44, The problem with the C++ input stream, if I remember correctly, is that SERD uses a C mechanism that was not directly compatible. Is that right, @joachimvh? |
@RubenVerborgh Thanks for making @joachimvh It would be great if Somebody™ could fix the input stream issue. Serializing and/or unpacking files prior to starting the HDT generation is a big performance hit. |
I honestly don't remember anymore, would have to look into it again. |
If you are using linux or mac, there is a trick called "process substitution" that allows you to create an anonymous FIFO pipe so the process thinks it is reading from a file:
I would love that serd supported gzip though :-) |
Hello. I will have to take a look through your code to understand what's going on here better, but yes, serd uses standard C FILE* I/O. As a minimal dependency-free library I don't think it would be appropriate to put gzip support into serd itself literally, however the API should make this possible. What I think might work is to add something like a Easiest route would be allowing the user to pass a function exactly like |
Allowing to pass in an |
Hi, I agree with both of you. It's not worth complicating serd with libz, but having some abstraction like a custom read would enable anybody to push data directly to serd without intermediate files when using it as a library. Thanks for your help! :-) |
@MarioAriasGa : process subsitution doesn't seem to work with HDT. It generates a file that returns zero results for queries, and the header info indicates no triples are stored With process substitution:
With a file:
|
@LaurensRietveld You are right, I think it's because the C++ version uses two passes, the Java one does it in one pass and should work. @drobilla Awesome, I will give it a try :-) |
I pushed the branch (serd-gzip with some initial code. It seems to work but it needs some testing. Thanks again @drobilla ! :-) |
Checking out your code I realized I botched the API slightly by still only allowing a bool to be passed for "paging" rather than an arbitrary size which is what makes sense for a custom sink function. The revised version is in drobilla/serd@52d3653 I updated your code to use the new API, but I was curious if the simpler I haven't done very extensive testing but things seem to work at first glance. It would be quite nice if rdfhdt could take advantage of serd's streaming Turtle support in the other direction to do abbreviated dumps of very large data sets, but that's a task for another day... (and you can currently accomplish this by piping through |
Really nice work @drobilla ! |
Awesome work @drobilla, thanks very much for your help. I merged the changes :-D |
Cool, you're welcome. I'll kick out a new release in a bit after checking things over. |
Can we close this issue, @drobilla, @wouterbeek? |
Seems fine to me. |
I'm still not able to stream to HDT :( E.g., the following MWE does not return the echo'ed triple (tested on the master branch): ./rdf2hdt <( echo "<x:x> <y:y> <z:z> ." ) test.hdt && ./hdt2rdf test.hdt - Is there something I'm missing? Are others able to stream to HDT? |
What I understood from @webdata was that rdf2hdt has to stream twice through the input file. Once for creating the dictionary, and once for the triples. It would explain why the above doesn't work (it only creates the dictionary) |
@LaurensRietveld Does that also work with process substitution, as in the example given by @MarioAriasGa ? I'm not sure how to indicate to the command line tool that two passes are required. BTW what's the status of the branch serd-gzip? Should this be merged into |
Can we adjust the README to reflect this? |
I don't rdf2hdt to work in a streaming fashion (either with process substitution or regular pipes) when it has to do two runs through the file. So we should either support gzip in hdt (again), or we should support building an HDT file in one pass (if I understand correctly, that's what java does. Not sure about it's limitations and overhead though). |
Sorry for the rambling above, turns out that importing gzip does work (had to update a simple CLI check to get it to work: #151).
Considering this closed |
While I note this is closed, any chance that ttl.bz2 could also be supported likewise?
|
The best thing would be if HDT would support arbitrary input streams. E.g., you would be able to do something like the following (without having to perform decompression from various formats from within the HDT codebase):
Unfortunately, the C++ implementation must stream through the data twice, so this streaming behavior in inherently difficult to implement ATM. |
One could always adapt the one-pass importer from HDT-Java, it should be perfectly possible and will allow this streaming fashion. Anyone interested to code it in C++? |
On branch
stable
I am able to create an HDT file from a gzipped N-Triples file (this probably uses Raptor). On branchmaster
this no works (this probably uses Serd).What is the best way to create an HDT file from a gizpped N-Triples file on
master
? Could the library be taught to do the right thing automatically?The text was updated successfully, but these errors were encountered: