-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider multithreading #8
Comments
Thanks for your suggestion. As we could achieve parallelism for our workload by simply running on some of the many datasets we process in parallel, we did not put a lot of effort into this yet. |
The parallelism Jan talking about is related to the processing of several datasets at the same time. Sarven, what you would like to achieve is processing of the same dataset on several cores (please correct me if I am wrong). I will take a look into the issue. |
Ivan, that's correct. One dataset at a time handled by multiple cores. If an N-Triples file is split into the number of cores that's available or input from the user, it should (in theory) reduce total time to process. Probably like O(n/number-of-cores). IIRC, it is O(n) now. One additional enhancement (this can be another issue if not already done or mentioned in the paper - sorry, it is late and I'm lazy to check): if some of the criterions can use the results from other criterions, the reusable criterions should be handled early on. If the results from those criterions are not available (e.g., if criterion is not checked), then it can do it itself. Basically, this is about treating criterions as dependencies of one another where applicable. |
Just a note: I have a use case where I need to process a bunch of files, so xargs seems to serve this use case well: ls *.gz | xargs -P 4 -n 1 do-something Source: http://blog.labrat.info/20100429/using-xargs-to-do-parallel-processing/ I am about to try this out now. |
Claus, that's inline with Jan's thoughts. What I'm suggesting needs to split a single large file (which is our void:Dataset), into whatever is appropriate (e.g., number of cores on the system), let lodstats do its thing per process, then join the results back for the final output. |
I know, that's why I said its for the use case with multiple files, and I wanted to give a concrete example how it is supposed to work. For multi-core or multi-node processing (if its done properly, there is little overhead in getting from one scenario to the other), I would favor the code to be made usable with e.g. hadoop (the winner of the billion triple challenge in 2010 was exactly about this [1]), but thats not important to me right now, so I won't request it g |
Currently a single CPU is used to process RDF files. Consider multithreading and use available CPUs in the system (or let the user enter from command-line e.g.,
--cpu 4
). I suspect that this would be a considerable increase in speed when dealing with N-Triples files.The text was updated successfully, but these errors were encountered: