How to load a gzipped N-Triples file? #47

wouterbeek · 2017-01-16T09:08:14Z

On branch stable I am able to create an HDT file from a gzipped N-Triples file (this probably uses Raptor). On branch master this no works (this probably uses Serd).

What is the best way to create an HDT file from a gizpped N-Triples file on master? Could the library be taught to do the right thing automatically?

The text was updated successfully, but these errors were encountered:

RubenVerborgh · 2017-01-16T10:14:00Z

On stable, this probably uses the old built-in N-Triples parser, which was severely broken, so it was replaced by SERD in bc7a258.

After #25, a gzipped pipeline should be possible again (and we can then remove the gzip dependency from hdt-cpp).

mielvds · 2017-01-16T10:18:32Z

I would also prefer the piping option. However, #25 does not provide any actual stdin support, only enables it. Any volunteers?

RubenVerborgh · 2017-01-16T10:20:21Z

We should probably extend #25 so that stdin is in scope.

wouterbeek · 2017-01-16T11:04:01Z

I see now that branch stable no longer exists. Does this mean that master is now the stable branch? The most generic option would be to have a function that loads N-Triples/N-Quads (simply skipping the graph argument, if present) data from a C++ input stream.

Everything else -- compression, stdin, custom C++ code calling HDT -- could be build around that.

RubenVerborgh · 2017-01-16T11:05:25Z

As per #44, master is the old stable branch and develop is where development happens.

The problem with the C++ input stream, if I remember correctly, is that SERD uses a C mechanism that was not directly compatible. Is that right, @joachimvh?

wouterbeek · 2017-01-16T11:43:19Z

@RubenVerborgh Thanks for making master the stable branch.

@joachimvh It would be great if Somebody™ could fix the input stream issue. Serializing and/or unpacking files prior to starting the HDT generation is a big performance hit.

joachimvh · 2017-01-16T12:00:18Z

I honestly don't remember anymore, would have to look into it again.

MarioAriasGa · 2017-04-10T21:17:03Z

If you are using linux or mac, there is a trick called "process substitution" that allows you to create an anonymous FIFO pipe so the process thinks it is reading from a file:

$ rdf2hdt <( gzip -cd file.nt.gz ) out.hdt

I would love that serd supported gzip though :-)

drobilla · 2017-04-11T10:31:14Z

Hello. I will have to take a look through your code to understand what's going on here better, but yes, serd uses standard C FILE* I/O.

As a minimal dependency-free library I don't think it would be appropriate to put gzip support into serd itself literally, however the API should make this possible. What I think might work is to add something like a serd_reader_start_custom_stream which allows the user to pass a callback for reading data. Would this work on your end?

Easiest route would be allowing the user to pass a function exactly like fread (which serd already uses) and a corresponding stream pointer. Then you could hook this up directly to zlib or do your own thing if necessary. I think this can be done without breaking the ABI.

JanWielemaker · 2017-04-11T11:31:10Z

Allowing to pass in an fread like function with a void* handle providing the stream context should be enough to support adding gzip support for applications like the HDT library. I think that should be a simple clean solution from serd's point of view.

MarioAriasGa · 2017-04-11T19:19:43Z

Hi, I agree with both of you. It's not worth complicating serd with libz, but having some abstraction like a custom read would enable anybody to push data directly to serd without intermediate files when using it as a library.

Thanks for your help! :-)

LaurensRietveld · 2017-04-14T13:24:15Z

@MarioAriasGa : process subsitution doesn't seem to work with HDT. It generates a file that returns zero results for queries, and the header info indicates no triples are stored

With process substitution:

$ rdf2hdt  <( cat /tmp/1 ) test && hdtInfo test | grep void
<file:///dev/fd/63> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<file:///dev/fd/63> <http://rdfs.org/ns/void#triples> "0" .
<file:///dev/fd/63> <http://rdfs.org/ns/void#properties> "1622" .
<file:///dev/fd/63> <http://rdfs.org/ns/void#distinctSubjects> "1413" .
<file:///dev/fd/63> <http://rdfs.org/ns/void#distinctObjects> "1622" .

With a file:

$ rdf2hdt  /tmp/1 test && hdtInfo test | grep void
<file:///tmp/1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<file:///tmp/1> <http://rdfs.org/ns/void#triples> "1622" .
<file:///tmp/1> <http://rdfs.org/ns/void#properties> "1622" .
<file:///tmp/1> <http://rdfs.org/ns/void#distinctSubjects> "1413" .
<file:///tmp/1> <http://rdfs.org/ns/void#distinctObjects> "1622" .

drobilla · 2017-04-14T19:07:05Z

drobilla/serd@1ae7934

MarioAriasGa · 2017-04-14T20:22:42Z

@LaurensRietveld You are right, I think it's because the C++ version uses two passes, the Java one does it in one pass and should work.

@drobilla Awesome, I will give it a try :-)

MarioAriasGa · 2017-04-16T22:29:16Z

I pushed the branch (serd-gzip with some initial code. It seems to work but it needs some testing.

Thanks again @drobilla ! :-)

drobilla · 2017-04-24T19:56:49Z

Checking out your code I realized I botched the API slightly by still only allowing a bool to be passed for "paging" rather than an arbitrary size which is what makes sense for a custom sink function. The revised version is in drobilla/serd@52d3653

I updated your code to use the new API, but I was curious if the simpler gzread API would work (so there isn't three paging mechanisms going at once), and it seems fine. So I made that change, and... quite a few others, since this got a bit rabbit-holey as these things do :)

#61

I haven't done very extensive testing but things seem to work at first glance.

It would be quite nice if rdfhdt could take advantage of serd's streaming Turtle support in the other direction to do abbreviated dumps of very large data sets, but that's a task for another day... (and you can currently accomplish this by piping through serdi anyway, if more slowly).

mielvds · 2017-04-25T08:16:46Z

Really nice work @drobilla !

MarioAriasGa · 2017-04-25T21:54:12Z

Awesome work @drobilla, thanks very much for your help. I merged the changes :-D

drobilla · 2017-04-26T13:38:20Z

Cool, you're welcome. I'll kick out a new release in a bit after checking things over.

RubenVerborgh · 2017-07-09T13:42:52Z

Can we close this issue, @drobilla, @wouterbeek?

drobilla · 2017-07-09T14:52:25Z

Seems fine to me.

wouterbeek · 2017-10-02T12:31:33Z

I'm still not able to stream to HDT :( E.g., the following MWE does not return the echo'ed triple (tested on the master branch):

./rdf2hdt <( echo "<x:x> <y:y> <z:z> ." ) test.hdt && ./hdt2rdf test.hdt -

Is there something I'm missing? Are others able to stream to HDT?

LaurensRietveld · 2017-10-02T12:56:11Z

What I understood from @webdata was that rdf2hdt has to stream twice through the input file. Once for creating the dictionary, and once for the triples. It would explain why the above doesn't work (it only creates the dictionary)

wouterbeek · 2017-10-02T14:43:44Z

@LaurensRietveld Does that also work with process substitution, as in the example given by @MarioAriasGa ? I'm not sure how to indicate to the command line tool that two passes are required.

BTW what's the status of the branch serd-gzip? Should this be merged into develop, and if so, would it make streaming through process substitution possible?

mielvds · 2017-10-18T07:48:35Z

Can we adjust the README to reflect this?

LaurensRietveld · 2017-10-19T06:35:01Z

I don't rdf2hdt to work in a streaming fashion (either with process substitution or regular pipes) when it has to do two runs through the file. So we should either support gzip in hdt (again), or we should support building an HDT file in one pass (if I understand correctly, that's what java does. Not sure about it's limitations and overhead though).
But, feel free to correct me if I'm way off here though ;)

LaurensRietveld · 2018-01-06T10:53:36Z

Sorry for the rambling above, turns out that importing gzip does work (had to update a simple CLI check to get it to work: #151).

$ rdf2hdt input.nt.gz test.hdt && hdtInfo test.hdt | grep "void#triples"
<file:///input.nt.gz> <http://rdfs.org/ns/void#triples> "3500000" .

Considering this closed

AxelPolleres · 2018-03-29T00:17:44Z

While I note this is closed, any chance that ttl.bz2 could also be supported likewise?

$ rdf2hdt test.ttl.bz2 test.hdt
ERROR: Detected "bz2" input format. Must be one of:
	- n3
	- ntriples or nt
	- nquads or nq
	- turtle or ttl

wouterbeek · 2018-03-29T09:23:03Z

The best thing would be if HDT would support arbitrary input streams. E.g., you would be able to do something like the following (without having to perform decompression from various formats from within the HDT codebase):

$ bunzip test.ttl.bz2 | rdf2hdt -o test.hdt

Unfortunately, the C++ implementation must stream through the data twice, so this streaming behavior in inherently difficult to implement ATM.

webdata · 2018-03-29T09:37:57Z

One could always adapt the one-pass importer from HDT-Java, it should be perfectly possible and will allow this streaming fashion. Anyone interested to code it in C++?

mielvds added this to the Early 2017 milestone Jan 16, 2017

mielvds modified the milestones: Late 2017, Early 2017 Feb 6, 2017

RubenVerborgh closed this as completed Jul 9, 2017

RubenVerborgh mentioned this issue Jul 9, 2017

Add method to directly load triples into BasicHDT #25

Closed

rubensworks mentioned this issue Sep 22, 2017

Remove Raptor support for everything but XML #77

Closed

wouterbeek reopened this Oct 2, 2017

mielvds mentioned this issue Oct 18, 2017

rdf2hdt of .gz file #109

Closed

LaurensRietveld mentioned this issue Dec 13, 2017

rdf2hdt problem reading from /dev/stdin? #141

Closed

LaurensRietveld closed this as completed Jan 6, 2018

mielvds mentioned this issue Jul 30, 2018

Create an HDT in one pass. #186

Open

mielvds mentioned this issue Feb 28, 2022

Handle multiple input files with rdf2hdt #233

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load a gzipped N-Triples file? #47

How to load a gzipped N-Triples file? #47

wouterbeek commented Jan 16, 2017

RubenVerborgh commented Jan 16, 2017

mielvds commented Jan 16, 2017

RubenVerborgh commented Jan 16, 2017

wouterbeek commented Jan 16, 2017

RubenVerborgh commented Jan 16, 2017

wouterbeek commented Jan 16, 2017

joachimvh commented Jan 16, 2017

MarioAriasGa commented Apr 10, 2017

drobilla commented Apr 11, 2017

JanWielemaker commented Apr 11, 2017

MarioAriasGa commented Apr 11, 2017

LaurensRietveld commented Apr 14, 2017

drobilla commented Apr 14, 2017

MarioAriasGa commented Apr 14, 2017

MarioAriasGa commented Apr 16, 2017

drobilla commented Apr 24, 2017

mielvds commented Apr 25, 2017

MarioAriasGa commented Apr 25, 2017

drobilla commented Apr 26, 2017

RubenVerborgh commented Jul 9, 2017

drobilla commented Jul 9, 2017

wouterbeek commented Oct 2, 2017

LaurensRietveld commented Oct 2, 2017

wouterbeek commented Oct 2, 2017 •

edited

Loading

mielvds commented Oct 18, 2017

LaurensRietveld commented Oct 19, 2017

LaurensRietveld commented Jan 6, 2018 •

edited

Loading

AxelPolleres commented Mar 29, 2018

wouterbeek commented Mar 29, 2018

webdata commented Mar 29, 2018

How to load a gzipped N-Triples file? #47

How to load a gzipped N-Triples file? #47

Comments

wouterbeek commented Jan 16, 2017

RubenVerborgh commented Jan 16, 2017

mielvds commented Jan 16, 2017

RubenVerborgh commented Jan 16, 2017

wouterbeek commented Jan 16, 2017

RubenVerborgh commented Jan 16, 2017

wouterbeek commented Jan 16, 2017

joachimvh commented Jan 16, 2017

MarioAriasGa commented Apr 10, 2017

drobilla commented Apr 11, 2017

JanWielemaker commented Apr 11, 2017

MarioAriasGa commented Apr 11, 2017

LaurensRietveld commented Apr 14, 2017

drobilla commented Apr 14, 2017

MarioAriasGa commented Apr 14, 2017

MarioAriasGa commented Apr 16, 2017

drobilla commented Apr 24, 2017

mielvds commented Apr 25, 2017

MarioAriasGa commented Apr 25, 2017

drobilla commented Apr 26, 2017

RubenVerborgh commented Jul 9, 2017

drobilla commented Jul 9, 2017

wouterbeek commented Oct 2, 2017

LaurensRietveld commented Oct 2, 2017

wouterbeek commented Oct 2, 2017 • edited Loading

mielvds commented Oct 18, 2017

LaurensRietveld commented Oct 19, 2017

LaurensRietveld commented Jan 6, 2018 • edited Loading

AxelPolleres commented Mar 29, 2018

wouterbeek commented Mar 29, 2018

webdata commented Mar 29, 2018

wouterbeek commented Oct 2, 2017 •

edited

Loading

LaurensRietveld commented Jan 6, 2018 •

edited

Loading