ensemble processing idea for dask distributed #542
pavlis
started this conversation in
Design & Development
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While working on the upcoming Earthscope short course on MsPASS I was reading through the new documentation on dask distributed. It made me realize there as a potentially very useful model for processing large ensembles that could both improve throughput and reduce the likelihood of memory bloat killing a job.
In a nutshell, the idea is to use the newer approach for concurrent computing using their "Futures" object described here. How to use a "Futures" in MsPASS is the question that I had a bit of a epiphany on. The idea simply is that if you need to process the members of a large ensemble, submit the algorithm as a "Futures" that returns changes the ensemble (or reduces it) and runs the algorithm asynchronously. In that model, I think you could run map operations on atomic members and call compute at the end of the function submitted as a Futures. The scheduler would treat that set of tasks like any task in our normal map-reduce model, but the atomic tasks could be handled by the scheduler and reduce memory bloat.
I think the basic code would be to define an ensemble function like this:
Now that example may not be exactly right. I'm not sure the assignments will work correctly in the example_atomic function. The real point is using that as a Futures that would be typically done something like this:
In that algorithm a big unknown is if dask would handle the mix of futures with a bag in scheduling. That is, however, only one way to exploit this idea. The big point is the generic idea of using submit to send a large ensemble to the cluster to be processed assynchronously.
Beta Was this translation helpful? Give feedback.
All reactions