Make recursive delete operations more intuitive #3589

svkrieger · 2024-01-09T10:07:18Z

Issue

Currently recursive delete operations like org, space, app delete, which implicitly delete service bindings and/or service instances, will fail if one of the service related deletions are handled asynchronously by the service broker. This is not optimal as users have to trigger the deletion of the parent resource again. Also currently users will get the following error message, which does not really reveal what is going on:
An operation for the service binding between app myapp and service instance myinstance is in progress.

Context

[provide more detailed introduction]

Steps to Reproduce

For example delete an app, which is bound to a service instance. The broker should answer unbinding requests with a 202.

Current result

The following tables describe the resulting behaviour for different resources and responses by the broker. In general all the recursive deletions fail if a sub-resource gets deleted asynchronously. The behaviour is the same for service instances and service bindings.

Delete service instance when service binding present

API call	resulting request to broker	broker response	current behaviour
DELETE /v3/service_instance/ <service_instance_guid>	DELETE /v2/service_instances/ <si_guid>/service_bindings/<sb_guid>	202	Starts polling service binding last operation and sets service instance delete job and service instance last operation to failed with message "delete could not be completed: An operation for the service binding between app myapp and service instance myinstance is in progress."
		500	Both the service instance and service binding delete operations fail
		200	Binding will be gone immediately, then a delete service instance request will be sent to the broker and either its gone too or polling starts

Delete app when bound to service

API call	resulting request to broker	broker response	current behaviour
DELETE /v3/apps/<app_guid>	DELETE /v2/service_instances/ <si_guid>/service_bindings/<sb_guid>	202	Starts polling service binding last operation and sets app delete job to failed with message "Job (3d1d051f-6c94-47ab-85e2-dec27e0db75a) failed: An operation for the service binding between app myapp and service instance myinstance is in progress."
		500	The app delete job fails and the service binding operations state is set to failed. Error message: "Job (17662f61-4111-4712-91da-d3ab11629ba7) failed: Service broker failed to delete service binding for instance myinstance: The service broker returned an invalid response. Status Code: 500 Internal Server Error, Body: {"state":"in progress"}"
		200	Service binding gets deleted, app gets deleted and app delete job set to COMPLETE

Delete space which contains a service binding

API call	resulting request to broker	broker response	current behaviour
DELETE /v3/spaces/ <space_guid>	DELETE /v2/service_instances/ <si_guid>/service_bindings/<sb_guid>	202	Starts polling service binding last operation and sets space delete job to failed as well as service instance last operation with message "Job (5b91523b-fe11-4cdb-bbc9-063b65fa8dee) failed: Deletion of space myspace failed because one or more resources within could not be deleted. An operation for the service binding between app myapp and service instance myinstance is in progress."
		500	The space delete job fails and the service binding and service instance last operations state is set to failed. Error message: "Job (ebce0407-9560-44fb-9b39-7b06f35edb4f) failed: Deletion of space d071102 failed because one or more resources within could not be deleted. Service broker failed to delete service binding for instance myinstance: The service broker returned an invalid response. Status Code: 500 Internal Server Error, Body: {"state":"in progress"}"
		200	Service binding, service instance, app and space gets deleted, and space delete job set to COMPLETE

Delete org, which contains a space which contains a service binding

API call	resulting request to broker	broker response	current behaviour
DELETE /v3/organizations/ <organization_guid>	DELETE /v2/service_instances/ <si_guid>/service_bindings/<sb_guid>	202	Starts polling service binding last operation and sets org delete job to failed as well as service instance last operation with message "Job (0daca787-a8b5-4433-967c-3b0c8d2e1798) failed: Deletion of organization d071102 failed because one or more resources within could not be deleted. Deletion of space d071102 failed because one or more resources within could not be deleted. An operation for the service binding between app myapp and service instance myinstance is in progress."
		500	The org delete job fails and the service binding and service instance last operations state is set to failed. Error message: "Job (5c7ffdc1-7bb3-4a95-96da-33d19c5d4e79) failed: Deletion of organization d071102 failed because one or more resources within could not be deleted. Deletion of space d071102 failed because one or more resources within could not be deleted. Service broker failed to delete service binding for instance myinstance: The service broker returned an invalid response. Status Code: 500 Internal Server Error, Body: {"state":"in progress"}"
		200	Everything gets deleted and organization delete job set to COMPLETE

Further findings

All recursive deletions will trigger the deletion of all sub-resources (except they depend on each other). For example an app delete will trigger the deletion of all service bindings of that app. If one binding fails to delete or is being deleted asynchronously, the job will continue to trigger the deletion of all other bindings. Service instances, which have bindings which are in deletion won't be deleted.

Expected result

Best case would be if the recursive delete operations can handle asynchronously deleted sub-resources. See next section for some ideas on how to achieve this.

Possible Fix

Re-enqueue recursive jobs instead of setting them to failed

The deletion jobs could be re-enqueued similarly to what we do for the polling mechanism of service related operations. The job could then check whether the resources have been deleted successfully and if so delete the parent resource.

Some thoughts on this:

Probably we would need a "locking mechanism" to prevent that in an org or space etc., which are being deleted, new resources are being created.
If asynchronous deletions fail, the job should remember that it tried to delete this resource already, otherwise this might be an endless loop.
A parameter, which allows configuring a maximum timeout for such jobs would be good
When the job fails because sub-resources could not be deleted, it would be good to show the original error message, why the deletion failed.

Delete parent resource immediately and continue asynchronous deletion of sub-resources in the background

If a service broker responds with a 202 for an unbind or deprovision request we can assume that the broker will take care of the deletion and at least "delete" it from the user perspective. The CC could then continue polling the last operation state from the broker. If the deletion fails, orphan mitigation could take over.

Some thoughts on this:

In the worst case the deletion of the service binding times out after the max_poll_intervall. How to proceed with the service instance then?
What if a user wants to create resources with the same names again after the CC stated they have been deleted, but in reality the deletion is still going on in the background?

Related issues

/v3/app delete does not wait until service binding are unbound #3333 - Describes the behaviour for app delete already
Add none recursive service instance deletion parameter #3532 - Ask for a "recursive" parameter

The text was updated successfully, but these errors were encountered:

svkrieger mentioned this issue Jan 10, 2024

Enhance CATS-SB cloudfoundry/cf-acceptance-tests#1021

Draft

7 tasks

LukasHeimann mentioned this issue Jan 23, 2024

App Deletion with Service Binding to Async Service Broker cloudfoundry-community/terraform-provider-cloudfoundry#546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make recursive delete operations more intuitive #3589

Make recursive delete operations more intuitive #3589

svkrieger commented Jan 9, 2024

Make recursive delete operations more intuitive #3589

Make recursive delete operations more intuitive #3589

Comments

svkrieger commented Jan 9, 2024

Issue

Context

Steps to Reproduce

Current result

Delete service instance when service binding present

Delete app when bound to service

Delete space which contains a service binding

Delete org, which contains a space which contains a service binding

Further findings

Expected result

Possible Fix

Re-enqueue recursive jobs instead of setting them to failed

Delete parent resource immediately and continue asynchronous deletion of sub-resources in the background

Related issues