Consider multithreading #8

csarven · 2013-08-06T08:23:10Z

Currently a single CPU is used to process RDF files. Consider multithreading and use available CPUs in the system (or let the user enter from command-line e.g., --cpu 4). I suspect that this would be a considerable increase in speed when dealing with N-Triples files.

The text was updated successfully, but these errors were encountered:

jandemter · 2013-08-07T18:01:25Z

Thanks for your suggestion. As we could achieve parallelism for our workload by simply running on some of the many datasets we process in parallel, we did not put a lot of effort into this yet.
True multithreading is unfortunately not very well supported (or rather more close to not at all) in cpython (and pypy as well I think), so we'd have to use multiprocessing (or jython) instead to get performance improvements.
Implementing this should not be too hard. Ivan, would this be of interest for your future work with lodstats?

earthquakesan · 2013-08-07T22:13:18Z

The parallelism Jan talking about is related to the processing of several datasets at the same time. Sarven, what you would like to achieve is processing of the same dataset on several cores (please correct me if I am wrong). I will take a look into the issue.

csarven · 2013-08-07T22:38:15Z

Ivan, that's correct. One dataset at a time handled by multiple cores. If an N-Triples file is split into the number of cores that's available or input from the user, it should (in theory) reduce total time to process. Probably like O(n/number-of-cores). IIRC, it is O(n) now.

One additional enhancement (this can be another issue if not already done or mentioned in the paper - sorry, it is late and I'm lazy to check): if some of the criterions can use the results from other criterions, the reusable criterions should be handled early on. If the results from those criterions are not available (e.g., if criterion is not checked), then it can do it itself. Basically, this is about treating criterions as dependencies of one another where applicable.

Aklakan · 2013-08-16T17:43:51Z

Just a note: I have a use case where I need to process a bunch of files, so xargs seems to serve this use case well:

ls *.gz | xargs -P 4 -n 1 do-something

Source: http://blog.labrat.info/20100429/using-xargs-to-do-parallel-processing/

I am about to try this out now.

csarven · 2013-08-16T18:05:42Z

Claus, that's inline with Jan's thoughts. What I'm suggesting needs to split a single large file (which is our void:Dataset), into whatever is appropriate (e.g., number of cores on the system), let lodstats do its thing per process, then join the results back for the final output.

Aklakan · 2013-08-16T18:14:56Z

I know, that's why I said its for the use case with multiple files, and I wanted to give a concrete example how it is supposed to work. For multi-core or multi-node processing (if its done properly, there is little overhead in getting from one scenario to the other), I would favor the code to be made usable with e.g. hadoop (the winner of the billion triple challenge in 2010 was exactly about this [1]), but thats not important to me right now, so I won't request it g

http://challenge.semanticweb.org/submissions.html

ghost assigned earthquakesan Aug 7, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider multithreading #8

Consider multithreading #8

csarven commented Aug 6, 2013

jandemter commented Aug 7, 2013

earthquakesan commented Aug 7, 2013

csarven commented Aug 7, 2013

Aklakan commented Aug 16, 2013

csarven commented Aug 16, 2013

Aklakan commented Aug 16, 2013

Consider multithreading #8

Consider multithreading #8

Comments

csarven commented Aug 6, 2013

jandemter commented Aug 7, 2013

earthquakesan commented Aug 7, 2013

csarven commented Aug 7, 2013

Aklakan commented Aug 16, 2013

csarven commented Aug 16, 2013

Aklakan commented Aug 16, 2013