-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass TES endpoints to WES request #170
Comments
@uniqueg This is an interesting idea that would definitely provide greater synergy between
|
Hi @patmagee, unfortunately I'm not on top of my GitHub notifications at all and didn't see your answer. They are all excellent points! I think we have thought about all of them, but I am not sure that we came to great/convincing solutions for some of them. Anyway, let me give it a try.
Not yet, but we have most of the pieces and just need to finalize some things and then tie them together. Specifically, we are working on an ELIXIR Cloud & AAI service registry (an implementation of the GA4GH Service Registry API) that would then hold all our Cloud API deployments, incuding TES backends. As I had mentioned in the original post, we have this very naive task distribution logic package/service TEStribute and we also have a sort of gateway TES service (proTES; 75% done) where we could plug that as middleware and then implement a client that fetches available TES instances from the service registry dynamically (currently, TES instances are hard coded). It will probably work to some extent but (a) TEStribute makes some assumptions about DRS and TES that are beyond current specs, (b) it would help if the GA4GH Service Registry supported fetching only services that are live and healthy and that a user is actually allowed to use, (c) the whole AAI flow is still not well/fully designed and won't be interoperable, so that requires discussions here, with TES, with Cloud WS in general, with FASP and with Passport... Also, as I mentioned, TEStribute is really quite naive at the moment, and it doesn't consider restrictions on where data can move (and if at all) etc. Still, it should be good enough to be able to prototype this at some point, possibly at the next Plenary in fall.
I would say it should certainly be an opt-in feature. WES on its own, even locally deployed and even in a single-tenant environment, has good use cases, and we don't expect that to change. Federating compute at the level of workflows, tasks and even individual computations (a whole other topic!) is important, because it serves use cases that otherwise couldn't or couldn't easily be done. However, I don't think it is likely to be dominating how computation is done in the life sciences anytime soon (if ever).
As I mentioned, the AAI flow is still not really worked out, to a large extent because we felt that Passports and how they are to be consumed in WES and DRS context is still very dynamic. Basically, we are waiting for this to settle down a bit. However, we have thought about it, of course, and to answer your more specific questions:
In principle, data access would work just like in WES.
Hmm, I'd venture that running tasks across different backends in this setup is a feature rather than a problem (think of load balancing, running workflows that during different steps need to access data at different locations that cannot move, etc.). So how can we ensure that costs are minimized? In TEStribute, we are basically allowing users to decide whether to prioritize costs or runtime for a given task. TES then has a sidecar service (or possibly future TES endpoint) that users can query with their task resource requirements and DRS URIs and which returns the estimated costs (sum of compute and data transfer and storage costs) and runtime (sum of waiting time, expected runtime and data transfer time) for each combination of input locations and TES instance (more details in this slide deck. If staying on a given TES for different tasks is MUCH cheaper than using another TES, compute will happen at that same TES (unless it's gonna be much slower and the user specified that they value a lower runtime over higher costs). That costs vs runtime parameter in our implementation is a float between 0 (cost-optimized only) and 1 (runtime-optimized only). Putting a reasonable value there (not 1!) should ensure that costs won't skyrocket. One other thing that leaves us a bit clueless is how to pass that extra info (like the costs/runtime param and whether they want to use a TES network at all) from the user to WES to TES. And how to implement it such that it won't become overly complicated to the user. But after all, this is something to figure out (at least partially) even in a (federated) WES-only world. Using TES would just add another layer to that that may help curb costs, balance loads and execute workflows in cases where data that cannot move resides at different places. Anyway, a lot to be discussed and done here, still, but good to have a start. And we might be able to have a prototype at some point next year, if things go well. I am attaching also this image to make the general design we envision a bit clearer: |
For WES implementations that support TES (e.g. cwl-WES), there is currently not an mechanism to provide a list of acceptable TES endpoints, requiring TES endpoints themselves or a mechanism to obtain them (e.g., from a GA4GH Service Registry) to be pre-configured in a given WES instance.
To enable effective task-level compute federation, it would be beneficial to add a way to pass either:
The mechanism could then be used to (1) allow a WES client to designate TES backends that the WES can use and (2) provide inputs to a task distribution middleware (e.g., TEStribute) integrated into a WES instance to "smartly" determine the best TES for a given task, e.g., the TES instance that is closest to the bulk of the data, has the lowest load.
The property should be optional in general, but any given WES implementation may choose to require it. Here's a quick-and-dirty draft for a corresponding schema:
Something like the following could then be added to the
RunRequest
schema:Note that one could of course skip the addition of a way to pass Service Registry URLs, as the client application could make the call to get TES URLs itself before calling WES. However, I have added it as it will allow a WES implementation to dynamically obtain available TES instances during execution. This might be beneficial in some cases, especially for long-running workflows. However, it is not really essential and makes the added schema more complicated. A simpler alternative would be to just add the following property to the
RunRequest
schema:The text was updated successfully, but these errors were encountered: