-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
80 clean up left over job status progress and result queues #63
80 clean up left over job status progress and result queues #63
Conversation
… TTL to job progress, status, and result queues
…eue message TTL feature to be turned off if required.
…to allow users to specify custom TTL
…e pamqp in pyproject.toml
|
||
if auto_dead_letter_after_ttl is not None: | ||
message_ttl = auto_dead_letter_after_ttl | ||
queue_ttl = auto_dead_letter_after_ttl * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed JOB_QUEUES_REMOVAL_BUFFER
(previously set too small) and set queue_ttl
to be two times of auto_dead_letter_after_ttl
Why: If the worker is running for a long time (assuming almost but slightly less than auto_dead_letter_after_ttl
time), the job result message might be returned to the queue where the queue is soon to expire. The message will be dropped silently due to the expired queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the rational correctly, then no SDK will have connected to this queue for a period of time. Lets call this time queue_offline_time
. If we set queue_ttl
== auto_dead_letter_after_ttl + 30 seconds
, then the queue will be deleted after the SDK has been offline in that period of time. We are assuming that a period of 48 hours (auto_dead_letter_after_ttl
) is sufficient to say that an error scenario has occurred and no SDK will retrieve any other messages from this queue. Wouldn't it then hold that auto_dead_letter_after_ttl
is sufficient time that the queue may be expired and any results are dropped silently?
I am making the assumption here that connecting to a queue is sufficient to not trigger the TTL. Lets say the queue TTL is set to 2 hours. Do you know if RabbitMQ also triggers the queue TTL and expires it if the queue has been empty for 2 hours but a consumer was connected in all that time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say SDK is never reconnected, and we want the job result message is at least (which is not the case here) dead lettered and logged by the orchestrator and queues to be cleaned up in the end. Then the issue raised here is that the job result message could actually be dropped silently due to the expired queue.
- SDK submits a job and declares queues (assume
auto_dead_letter_after_ttl = 1hr
, andqueue_ttl = auto_dead_letter_after_ttl + 30 seconds
) - Worker starts processing the job (assume taking 50 mins to complete)
- Meanwhile, SDK is offline 1 mins after the job submission due to errors. Queue TTL starts counting down. The queue will be expired in 1 hr + 30 sec
- The worker finishes the job (took 50 mins) and the result is returned to the queue. The job result will be dead lettered in 1 hr. However, the job queue is actually expired first before the result can be dead lettered.
"I am making the assumption here that connecting to a queue is sufficient to not trigger the TTL." -> You are right on this. Queue TTL is activated and starts counting down when there is no consumer (SDK offline). When the queue has a consumer (SDK reconnects), the Queue TTL is basically inactive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh gotcha! Does publishing the message to the queue not refresh the Queue TTL? Or are only consumer-side subscriptions counted towards refreshing a queue TTL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a misconception that republishing a message to the queue resets Queue TTL, but it turned out not the case...
Related to #80 Clean up left-over job status, progress & result queues