Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass TES endpoints to WES request #170

Open
uniqueg opened this issue May 6, 2021 · 2 comments
Open

Pass TES endpoints to WES request #170

uniqueg opened this issue May 6, 2021 · 2 comments

Comments

@uniqueg
Copy link
Contributor

uniqueg commented May 6, 2021

For WES implementations that support TES (e.g. cwl-WES), there is currently not an mechanism to provide a list of acceptable TES endpoints, requiring TES endpoints themselves or a mechanism to obtain them (e.g., from a GA4GH Service Registry) to be pre-configured in a given WES instance.

To enable effective task-level compute federation, it would be beneficial to add a way to pass either:

  • an array of TES instances that a WES could use
  • an array of Service Registries that WES could query to obtain a list of available TES instances

The mechanism could then be used to (1) allow a WES client to designate TES backends that the WES can use and (2) provide inputs to a task distribution middleware (e.g., TEStribute) integrated into a WES instance to "smartly" determine the best TES for a given task, e.g., the TES instance that is closest to the bulk of the data, has the lowest load.

The property should be optional in general, but any given WES implementation may choose to require it. Here's a quick-and-dirty draft for a corresponding schema:

TesBackends:
  type: object
  tes_urls:
    type: array
    items:
      type: string
  service_registry_urls:
    type: array
    items:
        type: str
  description: TES backends to be used to execute the workflow run.

Something like the following could then be added to the RunRequest schema:

tes_backends:
  $ref: '#/definitions/TesBackends'
  description: TES backends to be used to execute the workflow run.

Note that one could of course skip the addition of a way to pass Service Registry URLs, as the client application could make the call to get TES URLs itself before calling WES. However, I have added it as it will allow a WES implementation to dynamically obtain available TES instances during execution. This might be beneficial in some cases, especially for long-running workflows. However, it is not really essential and makes the added schema more complicated. A simpler alternative would be to just add the following property to the RunRequest schema:

tes_backends:
    type: array
    items:
      type: string
      description: Root URL of a TES API.
  description: TES backends to be used to execute the workflow run.
@patmagee
Copy link
Contributor

patmagee commented Jun 7, 2021

@uniqueg This is an interesting idea that would definitely provide greater synergy between WES and TES (something I think the spec greatly needs). Also, this may be a great path forward for defining a language agnostic workflow engine . I do have a few questions (which you may or may not have thought of) that may help flesh out this idea some more.

  1. Is there a reference implementation that allows dynamic configuration of TES engines?
  2. How does this fit into the current landscape of WES implementations. Would we expect existing engines to be able to be dynamically configured or is this more of a "opt in" feature.
  3. Have you put any thought of how authentication would work? The WES engine would need appropriate permissions to submit and poll the TES backend, at least until the task is done, if not longer (depending on how the engine works).
    • Would it be expected that a WES engine can communicate with ANY TES api without a pre-existing relationship with it? (ie plumb down the user's credentials)
    • Would a WES be able to define a list of pre-existing TES's that the user can submit jobs to? This would avoid a lot of auth issues and potentially even data access issues
  4. How would data access work?
    • How does the WES engine access the runtime logs? Would the WES need to move the logs or stream them to a separate accessible location?
    • How would data access for the end user and between steps work? Would it be expected that every TES API work with DRS to be able to get access to the data? If so, what credentials would WES use?
  5. How would we be able to avoid data egress between steps? Theoretically, allowing the user to define any TES would allow them to run task in different environment or clouds, which could incur huge costs if not managed properly

@uniqueg
Copy link
Contributor Author

uniqueg commented Dec 9, 2021

Hi @patmagee, unfortunately I'm not on top of my GitHub notifications at all and didn't see your answer. They are all excellent points! I think we have thought about all of them, but I am not sure that we came to great/convincing solutions for some of them. Anyway, let me give it a try.

  1. Is there a reference implementation that allows dynamic configuration of TES engines?

Not yet, but we have most of the pieces and just need to finalize some things and then tie them together. Specifically, we are working on an ELIXIR Cloud & AAI service registry (an implementation of the GA4GH Service Registry API) that would then hold all our Cloud API deployments, incuding TES backends. As I had mentioned in the original post, we have this very naive task distribution logic package/service TEStribute and we also have a sort of gateway TES service (proTES; 75% done) where we could plug that as middleware and then implement a client that fetches available TES instances from the service registry dynamically (currently, TES instances are hard coded). It will probably work to some extent but (a) TEStribute makes some assumptions about DRS and TES that are beyond current specs, (b) it would help if the GA4GH Service Registry supported fetching only services that are live and healthy and that a user is actually allowed to use, (c) the whole AAI flow is still not well/fully designed and won't be interoperable, so that requires discussions here, with TES, with Cloud WS in general, with FASP and with Passport... Also, as I mentioned, TEStribute is really quite naive at the moment, and it doesn't consider restrictions on where data can move (and if at all) etc. Still, it should be good enough to be able to prototype this at some point, possibly at the next Plenary in fall.

  1. How does this fit into the current landscape of WES implementations. Would we expect existing engines to be able to be dynamically configured or is this more of a "opt in" feature.

I would say it should certainly be an opt-in feature. WES on its own, even locally deployed and even in a single-tenant environment, has good use cases, and we don't expect that to change. Federating compute at the level of workflows, tasks and even individual computations (a whole other topic!) is important, because it serves use cases that otherwise couldn't or couldn't easily be done. However, I don't think it is likely to be dominating how computation is done in the life sciences anytime soon (if ever).

  1. Have you put any thought of how authentication would work? The WES engine would need appropriate permissions to submit and poll the TES backend, at least until the task is done, if not longer (depending on how the engine works).
    Would it be expected that a WES engine can communicate with ANY TES api without a pre-existing relationship with it? (ie plumb down the user's credentials)
    Would a WES be able to define a list of pre-existing TES's that the user can submit jobs to? This would avoid a lot of auth issues and potentially even data access issues

As I mentioned, the AAI flow is still not really worked out, to a large extent because we felt that Passports and how they are to be consumed in WES and DRS context is still very dynamic. Basically, we are waiting for this to settle down a bit. However, we have thought about it, of course, and to answer your more specific questions:

  • Given that in a federated compute network different Cloud API services are likely to be operated under different legal entities (institutes, companies), possibly across different countries, some relationship between the users and the TES instances has to be established at the very least. But probably also each WES and TES would have to have a trust relationship with one another. Other than that though, we feel, on a technical level, the interoperability should be guaranteed by the specs at some point (probably via Passport), so yeah, plumbing down the users' credentials - and back up, to get a new token upon expiry, for example. Whether a WES supports TES in principle should probably listed in some capabilities section of the service info.
  • In our model, we are using a single gateway TES that the WES instances point to. In this design individual WES instances wouldn't need to implement anything special. But that's just one design, nothing speaks against TES-aware WES implementations. Or create a TES-aware WES implementation of any non-aware one by co-deploying and tightly coupling a WES with a gateway TES, such that from the outside it looks like a single WES that internally has a separate service running that takes care of distribution. There are certainly other, perhaps better options, but this is what we have been thinking about for now. As for auth, I could really imagine that the Service Registry could have an endpoint that would give you services that the user has access to. Though to do that on an individual resource level (as in DRS) is probably not feasible. Apart from actual auth, GDPR is also an issue here, because (as mentioned above), it's also users that need to explicitly grant permission to each service individually, and they need to be able to revoke it at any time.
  1. How would data access work?
    How does the WES engine access the runtime logs? Would the WES need to move the logs or stream them to a separate accessible location?
    How would data access for the end user and between steps work? Would it be expected that every TES API work with DRS to be able to get access to the data? If so, what credentials would WES use?

In principle, data access would work just like in WES.

  • WES can get realtime logs from TES, currently through polling, although we would would like to see this improved by defining callbacks in the TES specs (this is possible in OpenAPI 3). Callbacks for status updates are also useful for WES. Issues are available on both repos, in this one, e.g., here: Status callback #132. Generally, it's probably a good idea to have WES pull the logs, because (a) users would poll WES for them, not TES, and (b) this also guarantess that whatever data management policies are guaranteed by a given WES that a user talks to (another issue to specify in the Service Info, perhaps), that WES would do well not to have to rely on the data management policies of another services (which may be more restrictive and/or which they may or may not fulfill).
  • Now THAT's a good question :) But yes, having a lightweight DRS co-deployed with TES would be a possible solution, and the one we have been toying with. Perhaps one could again have capabilities in a TES service info that would say something like private-drs or something (and a URL to that service and maybe auth instructions). We have implemented DRS-Filer, a very slim DRS for that purpose. As for auth in that scenario: if we plumb down the token and know who you are, we could also set the permissions for output on the private DRS such that the user (or their client on their behalf) has access.
  1. How would we be able to avoid data egress between steps? Theoretically, allowing the user to define any TES would allow them to run task in different environment or clouds, which could incur huge costs if not managed properly

Hmm, I'd venture that running tasks across different backends in this setup is a feature rather than a problem (think of load balancing, running workflows that during different steps need to access data at different locations that cannot move, etc.). So how can we ensure that costs are minimized? In TEStribute, we are basically allowing users to decide whether to prioritize costs or runtime for a given task. TES then has a sidecar service (or possibly future TES endpoint) that users can query with their task resource requirements and DRS URIs and which returns the estimated costs (sum of compute and data transfer and storage costs) and runtime (sum of waiting time, expected runtime and data transfer time) for each combination of input locations and TES instance (more details in this slide deck. If staying on a given TES for different tasks is MUCH cheaper than using another TES, compute will happen at that same TES (unless it's gonna be much slower and the user specified that they value a lower runtime over higher costs). That costs vs runtime parameter in our implementation is a float between 0 (cost-optimized only) and 1 (runtime-optimized only). Putting a reasonable value there (not 1!) should ensure that costs won't skyrocket.

One other thing that leaves us a bit clueless is how to pass that extra info (like the costs/runtime param and whether they want to use a TES network at all) from the user to WES to TES. And how to implement it such that it won't become overly complicated to the user. But after all, this is something to figure out (at least partially) even in a (federated) WES-only world. Using TES would just add another layer to that that may help curb costs, balance loads and execute workflows in cases where data that cannot move resides at different places.

Anyway, a lot to be discussed and done here, still, but good to have a start. And we might be able to have a prototype at some point next year, if things go well.

I am attaching also this image to make the general design we envision a bit clearer:

2021_12_09-elixir_cloud_schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants