Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

67 refactor implementation of left over job queues clean up #80

Merged

Conversation

cwang39403
Copy link
Contributor

Related to issues: #67, Project-OMOTES/omotes-system#80

Removed job result dead letter queue, dead letter exchange, and message TTL implementation to avoid SDK client reconnection occurring after the job result message is already dead-lettered and the SDK client will be waiting infinitely.

Tested manually with the following scenarios

1. Happy path

  1. The SDK client submits a job.
  2. The job is finished, returned to the queue, and consumed by the SDK client.
  3. All job queues are removed as intended.

2. Immediate reconnect

  1. The SDK client submits a job
  2. The job starts running, but the SDK client goes offline for 2 seconds (Queue TTL is activated).
  3. The SDK client attempts to reconnect, first checking if queues still exist.
  4. The queues still exist, the SDK client reconnects to the queues (Queue TTL is deactivated).
  5. The job is finished, returned to the queue, and consumed by the SDK client.
  6. All job queues are removed as intended.

3. Reconnect after the job is finished

  1. The SDK client submits a job.
  2. The job starts running, but the SDK client goes offline for 15 seconds (Queue TTL is activated).
  3. The job is finished, returned to the queue, and waited to be consumed.
  4. The SDK client attempts to reconnect, first checking if queues still exist.
  5. The queues still exist, the SDK client reconnects to the queues (Queue TTL is deactivated).
  6. The SDK client consumes the job result message and removes only the job result queue. Job progress and status queues remain in RabbitMQ.
  7. The job progress and status queues are removed after reaching their TTL.

4. Never reconnect

  1. The SDK client submits a job.
  2. The job starts running, but the SDK client goes offline forever.
  3. The job is finished, returned to the queue, and waited to be consumed.
  4. The job messages are dropped silently and queues are removed after reaching queue TTL.

5. Reconnect after queues have already expired

  1. The SDK client submits a job.
  2. The job starts running, but the SDK client goes offline for a long period (longer than the provided TTL).
  3. The job is finished, returned to the queue, and waited to be consumed.
  4. The job messages are dropped silently and queues are removed after reaching queue TTL.
  5. The SDK client attempts to reconnect, first checking if queues still exist.
  6. The queues do not exist, reconnection is aborted and an error message is thrown to the SDK client.
    image

Copy link
Member

@lfse-slafleur lfse-slafleur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great manual testing to ensure all edge cases are understood/working. Thanks for a great PR!

@lfse-slafleur
Copy link
Member

3. Reconnect after the job is finished

1. The SDK client submits a job.

2. The job starts running, but the SDK client goes offline for 15 seconds (Queue TTL is activated).

3. The job is finished, returned to the queue, and waited to be consumed.

4. The SDK client attempts to reconnect, first checking if queues still exist.

5. The queues still exist, the SDK client reconnects to the queues (Queue TTL is deactivated).

6. **The SDK client consumes the job result message and removes only the job result queue. Job progress and status queues remain in RabbitMQ.**

7. **The job progress and status queues are removed after reaching their TTL.**

@cwang39403 Apologies for misunderstanding your question here! I believe last week you intended to check if it should be the case that the Job progress & status queues remained while the result queue was deleted in this case? I originally understood that you meant to ask if it is okay if the job progress and status queues are deleted later than the job result queue. However, this behavior that you found does appear weird and warrants some investigation. I would expect that all 3 queues are deleted by the SDK after receiving a job result regardless if it reconnected.

Shall we open a ticket for this? Or I am misunderstanding again? Thanks for your thoroughness!

@cwang39403
Copy link
Contributor Author

@lfse-slafleur No worries, I didn't explain clearly enough during our weekly standup. Although in scenario 3, job progress and status queues will eventually be removed when reaching TTL (which fulfills the cleanup requirement), they are not deleted immediately after SDK consumes a job result message is indeed what I tried to point out.

My initial guess is there are some competing situations resulting in this outcome (reconnecting to queues & removing the job result queue after the job result message is consumed), but I am not entirely sure. Yes, could you create a ticket for this?

@cwang39403 cwang39403 merged commit 6a91e56 into main Oct 28, 2024
14 checks passed
@cwang39403 cwang39403 deleted the 67-refactor-implementation-of-left-over-job-queues-clean-up branch October 28, 2024 13:00
@cwang39403
Copy link
Contributor Author

@lfse-slafleur I had a further look into the behavior of scenario 3.

Looking at the logs. When the job result message is being consumed and the queue is being removed (which also triggers _autodelete_progres_status_queues_on_result in the meantime), the job progress and status queues are still being redeclared*. This explains why the progress and status queues are not deleted in this case.

*When we say reconnect to the queue, it actually redeclares the queue: connect_to_submitted_job() > declare_queue_and_add_subscription() -> _declare_queue(), and this leads to some delay.

image

If we wait the execution of auto_disconnect_on_result_handler for just a short period inside JobSubmissionCallbackHandler, The status and progress queues should be properly declared first and then deleted.

image

Curious about your thoughts.

@lfse-slafleur
Copy link
Member

Curious about your thoughts.

@cwang39403 Good find! We can probably fix this by switching the declarations around. You can put the https://github.com/Project-OMOTES/omotes-sdk-python/blob/main/src/omotes_sdk/omotes_interface.py#L266 declaration at the bottom and then it will ensure the other queues exist before the result queue is connected. We should also ensure that we document this in the code that the order of declarations is important and why.

@cwang39403
Copy link
Contributor Author

@lfse-slafleur Yes, good suggestion! I implemented and tested with scenario 3 again and it works properly now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate the scenario where SDK reconnects after the job result is dead lettered but the queue persists
3 participants