67 refactor implementation of left over job queues clean up #80

cwang39403 · 2024-10-22T14:33:10Z

Related to issues: #67, Project-OMOTES/omotes-system#80

Removed job result dead letter queue, dead letter exchange, and message TTL implementation to avoid SDK client reconnection occurring after the job result message is already dead-lettered and the SDK client will be waiting infinitely.

Tested manually with the following scenarios

1. Happy path

The SDK client submits a job.
The job is finished, returned to the queue, and consumed by the SDK client.
All job queues are removed as intended.

2. Immediate reconnect

The SDK client submits a job
The job starts running, but the SDK client goes offline for 2 seconds (Queue TTL is activated).
The SDK client attempts to reconnect, first checking if queues still exist.
The queues still exist, the SDK client reconnects to the queues (Queue TTL is deactivated).
The job is finished, returned to the queue, and consumed by the SDK client.
All job queues are removed as intended.

3. Reconnect after the job is finished

The SDK client submits a job.
The job starts running, but the SDK client goes offline for 15 seconds (Queue TTL is activated).
The job is finished, returned to the queue, and waited to be consumed.
The SDK client attempts to reconnect, first checking if queues still exist.
The queues still exist, the SDK client reconnects to the queues (Queue TTL is deactivated).
The SDK client consumes the job result message and removes only the job result queue. Job progress and status queues remain in RabbitMQ.
The job progress and status queues are removed after reaching their TTL.

4. Never reconnect

The SDK client submits a job.
The job starts running, but the SDK client goes offline forever.
The job is finished, returned to the queue, and waited to be consumed.
The job messages are dropped silently and queues are removed after reaching queue TTL.

5. Reconnect after queues have already expired

The SDK client submits a job.
The job starts running, but the SDK client goes offline for a long period (longer than the provided TTL).
The job is finished, returned to the queue, and waited to be consumed.
The job messages are dropped silently and queues are removed after reaching queue TTL.
The SDK client attempts to reconnect, first checking if queues still exist.
The queues do not exist, reconnection is aborted and an error message is thrown to the SDK client.

…ues clean up flow.

lfse-slafleur

Great manual testing to ensure all edge cases are understood/working. Thanks for a great PR!

lfse-slafleur · 2024-10-28T10:39:32Z

3. Reconnect after the job is finished

1. The SDK client submits a job.

2. The job starts running, but the SDK client goes offline for 15 seconds (Queue TTL is activated).

3. The job is finished, returned to the queue, and waited to be consumed.

4. The SDK client attempts to reconnect, first checking if queues still exist.

5. The queues still exist, the SDK client reconnects to the queues (Queue TTL is deactivated).

6. **The SDK client consumes the job result message and removes only the job result queue. Job progress and status queues remain in RabbitMQ.**

7. **The job progress and status queues are removed after reaching their TTL.**

@cwang39403 Apologies for misunderstanding your question here! I believe last week you intended to check if it should be the case that the Job progress & status queues remained while the result queue was deleted in this case? I originally understood that you meant to ask if it is okay if the job progress and status queues are deleted later than the job result queue. However, this behavior that you found does appear weird and warrants some investigation. I would expect that all 3 queues are deleted by the SDK after receiving a job result regardless if it reconnected.

Shall we open a ticket for this? Or I am misunderstanding again? Thanks for your thoroughness!

cwang39403 · 2024-10-28T12:58:30Z

@lfse-slafleur No worries, I didn't explain clearly enough during our weekly standup. Although in scenario 3, job progress and status queues will eventually be removed when reaching TTL (which fulfills the cleanup requirement), they are not deleted immediately after SDK consumes a job result message is indeed what I tried to point out.

My initial guess is there are some competing situations resulting in this outcome (reconnecting to queues & removing the job result queue after the job result message is consumed), but I am not entirely sure. Yes, could you create a ticket for this?

cwang39403 · 2024-11-07T16:51:04Z

@lfse-slafleur I had a further look into the behavior of scenario 3.

Looking at the logs. When the job result message is being consumed and the queue is being removed (which also triggers _autodelete_progres_status_queues_on_result in the meantime), the job progress and status queues are still being redeclared*. This explains why the progress and status queues are not deleted in this case.

*When we say reconnect to the queue, it actually redeclares the queue: connect_to_submitted_job() > declare_queue_and_add_subscription() -> _declare_queue(), and this leads to some delay.

If we wait the execution of auto_disconnect_on_result_handler for just a short period inside JobSubmissionCallbackHandler, The status and progress queues should be properly declared first and then deleted.

Curious about your thoughts.

lfse-slafleur · 2024-11-08T16:23:12Z

Curious about your thoughts.

@cwang39403 Good find! We can probably fix this by switching the declarations around. You can put the https://github.com/Project-OMOTES/omotes-sdk-python/blob/main/src/omotes_sdk/omotes_interface.py#L266 declaration at the bottom and then it will ensure the other queues exist before the result queue is connected. We should also ensure that we document this in the code that the order of declarations is important and why.

cwang39403 · 2024-11-12T10:52:20Z

@lfse-slafleur Yes, good suggestion! I implemented and tested with scenario 3 again and it works properly now!

cwang39403 added 2 commits October 22, 2024 13:32

Remove job result queue DLQ, DLX, and message TTL to simplify job que…

68801f4

…ues clean up flow.

Lower test coverage from 63% to 62%

6b8776f

cwang39403 mentioned this pull request Oct 22, 2024

Remove connecting and subscribing to the job result dead letter queue. Project-OMOTES/orchestrator#76

Merged

cwang39403 linked an issue Oct 22, 2024 that may be closed by this pull request

Investigate the scenario where SDK reconnects after the job result is dead lettered but the queue persists #67

Closed

cwang39403 requested review from lfse-slafleur and MarkTNO October 22, 2024 14:39

MarkTNO approved these changes Oct 25, 2024

View reviewed changes

lfse-slafleur approved these changes Oct 28, 2024

View reviewed changes

cwang39403 merged commit 6a91e56 into main Oct 28, 2024
14 checks passed

cwang39403 deleted the 67-refactor-implementation-of-left-over-job-queues-clean-up branch October 28, 2024 13:00

cwang39403 mentioned this pull request Nov 12, 2024

Reorder job queues declaration sequence to ensure job progress and status queues can be properly cleaned up under the SDK reconnection scenario #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

67 refactor implementation of left over job queues clean up #80

67 refactor implementation of left over job queues clean up #80

cwang39403 commented Oct 22, 2024

lfse-slafleur left a comment

lfse-slafleur commented Oct 28, 2024

cwang39403 commented Oct 28, 2024

cwang39403 commented Nov 7, 2024

lfse-slafleur commented Nov 8, 2024

cwang39403 commented Nov 12, 2024

67 refactor implementation of left over job queues clean up #80

67 refactor implementation of left over job queues clean up #80

Conversation

cwang39403 commented Oct 22, 2024

lfse-slafleur left a comment

Choose a reason for hiding this comment

lfse-slafleur commented Oct 28, 2024

cwang39403 commented Oct 28, 2024

cwang39403 commented Nov 7, 2024

lfse-slafleur commented Nov 8, 2024

cwang39403 commented Nov 12, 2024