Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add option to provision synchronously #967

Open
dbwiddis opened this issue Nov 19, 2024 · 4 comments
Open

[FEATURE] Add option to provision synchronously #967

dbwiddis opened this issue Nov 19, 2024 · 4 comments
Labels
enhancement New feature or request untriaged

Comments

@dbwiddis
Copy link
Member

Is your feature request related to a problem?

Presently, when provisioning a workflow (via either the provision API, create API with provision or param, the REST call returns immediately with a 200 (OK) response, but the caller must then poll the Workflow Status API to monitor the status of provisioning.

This asynchronous execution of provisioning was intentional to provide the ability for a front end to obtain status throughout provisioning, possibly including a progress bar or similar, and because some provisioning processes take longer than the expected time for a REST response.

However, there are some use cases where the user may be willing to wait for a completed response, and not have to poll. This would be particularly useful in cases similar to the ML Commons Remote Model deployment which provides such a synchronous API.

What solution would you like?

Add optional parameters to the create and provision work flow APIs to wait for the request to complete, with a timeout. Other OpenSearchAPIs use wait_for_completion and wait_for_completion_timeout so I'd suggest these names.

Alter the Provision Workflow Transport action, when this parameter is present, to wait to return until provisioning is complete (or the timeout).

What alternatives have you considered?

A separate wrapper API that does the retries internally.

Do you have any additional context?

This would be a much simpler approach for automation tools, that would not require them to code all the polling themselves.

@dbwiddis dbwiddis added enhancement New feature or request untriaged labels Nov 19, 2024
@arjunkumargiri
Copy link

Thanks @dbwiddis, this approach will help simplify provisioning/automation of opensearch resources with minimal client side code. Few follow up questions:

  • What is the default timeout config? Will workflow be terminated if the timeout is breached?
  • Will resources be in partial provisioned status in case of a timeout/failure?
  • Will the list of provisioned resources be included as part of provision API in case of wait_for_completion?

@dbwiddis
Copy link
Member Author

  • What is the default timeout config? Will workflow be terminated if the timeout is breached?

Probably the standard OpenSearch default timeout for Rest Requests.

We can handle timeout any way we want: cancelling the futures of a workflow in progress will probably suffice. Note that some workflow steps in progress may continue even after a cancellation but the overall workflow would stop executing.

  • Will resources be in partial provisioned status in case of a timeout/failure?

Yes.

  • Will the list of provisioned resources be included as part of provision API in case of wait_for_completion?

Sounds reasonable to provide the same return value as workflow status API.

@arjunkumargiri
Copy link

Can we rollback partially provisioned resources in case of failure?

@dbwiddis
Copy link
Member Author

Can we rollback partially provisioned resources in case of failure?

The deprovision API will do that.

We have not yet added an auto-rollback capability, which would be equally appropriate for a failed async provision.

Also, regarding cancellation, if we tried an immediate rollback it may not catch all the in-progress resources. For example, say we registering and deploying a local model and then creating an agent. Assume registering completes successfully but the deploy step times out because it's a very large model. Registration would create the model resource. Upon failure (the timeout), all the futures would be cancelled, meaning the agent would never run. However, the model deployment would eventually probably complete. If we tried to deprovision immediately we'd only see the registered model. (I'm not sure what happens if we try to delete a model which is in the process of deploying?) If we wait for the step to complete we might have it deployed. In that case you'd have both the register and deploy "resources" and you could successfully deprovision with an undeploy/delete.

This is just one simple example, it can get more complex. Which is why we haven't gotten to it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request untriaged
Projects
None yet
Development

No branches or pull requests

2 participants