Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider multithreading #8

Open
csarven opened this issue Aug 6, 2013 · 6 comments
Open

Consider multithreading #8

csarven opened this issue Aug 6, 2013 · 6 comments
Assignees

Comments

@csarven
Copy link
Member

csarven commented Aug 6, 2013

Currently a single CPU is used to process RDF files. Consider multithreading and use available CPUs in the system (or let the user enter from command-line e.g., --cpu 4). I suspect that this would be a considerable increase in speed when dealing with N-Triples files.

@jandemter
Copy link
Contributor

Thanks for your suggestion. As we could achieve parallelism for our workload by simply running on some of the many datasets we process in parallel, we did not put a lot of effort into this yet.
True multithreading is unfortunately not very well supported (or rather more close to not at all) in cpython (and pypy as well I think), so we'd have to use multiprocessing (or jython) instead to get performance improvements.
Implementing this should not be too hard. Ivan, would this be of interest for your future work with lodstats?

@ghost ghost assigned earthquakesan Aug 7, 2013
@earthquakesan
Copy link
Member

The parallelism Jan talking about is related to the processing of several datasets at the same time. Sarven, what you would like to achieve is processing of the same dataset on several cores (please correct me if I am wrong). I will take a look into the issue.

@csarven
Copy link
Member Author

csarven commented Aug 7, 2013

Ivan, that's correct. One dataset at a time handled by multiple cores. If an N-Triples file is split into the number of cores that's available or input from the user, it should (in theory) reduce total time to process. Probably like O(n/number-of-cores). IIRC, it is O(n) now.

One additional enhancement (this can be another issue if not already done or mentioned in the paper - sorry, it is late and I'm lazy to check): if some of the criterions can use the results from other criterions, the reusable criterions should be handled early on. If the results from those criterions are not available (e.g., if criterion is not checked), then it can do it itself. Basically, this is about treating criterions as dependencies of one another where applicable.

@Aklakan
Copy link
Member

Aklakan commented Aug 16, 2013

Just a note: I have a use case where I need to process a bunch of files, so xargs seems to serve this use case well:

ls *.gz | xargs -P 4 -n 1 do-something

Source: http://blog.labrat.info/20100429/using-xargs-to-do-parallel-processing/

I am about to try this out now.

@csarven
Copy link
Member Author

csarven commented Aug 16, 2013

Claus, that's inline with Jan's thoughts. What I'm suggesting needs to split a single large file (which is our void:Dataset), into whatever is appropriate (e.g., number of cores on the system), let lodstats do its thing per process, then join the results back for the final output.

@Aklakan
Copy link
Member

Aklakan commented Aug 16, 2013

I know, that's why I said its for the use case with multiple files, and I wanted to give a concrete example how it is supposed to work. For multi-core or multi-node processing (if its done properly, there is little overhead in getting from one scenario to the other), I would favor the code to be made usable with e.g. hadoop (the winner of the billion triple challenge in 2010 was exactly about this [1]), but thats not important to me right now, so I won't request it g

http://challenge.semanticweb.org/submissions.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants