-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More aggressive refetching when full #664
Conversation
Jobs can often be stuck on queues for a long while when jobs are sporadically enqueued in large batches. Prior to this change jobs would be fetched every FetchPollInterval (default 1s) unless new jobs were enqueued. This change reduces that to FetchCooldown (default 100ms) when a full set of jobs is returned implying that there may be more in the queue.
@gaffneyc while looking more closely at this, I decided to adjust the approach from #663 to hopefully make it a little more efficient. I realized that there's a single producer goroutine responsible for deciding to fetch new jobs and for removing completed ones from the list of running ones. I also realized that the benchmarks in #652 and the solution proposed in #653 are going to be highly dependent on whether or not any jobs complete before the next attempted fetch. If the jobs are synthetic and ~instantaneous, there will always be slots available on the next attempt, but if they're realistic, odds are the fetch limiter may fire again before we have any slots available—and then we're back to polling! I believe the approach in this PR is an improvement because it relies on a single boolean in the producer to let us delay triggering the fetch limiter until after we've freed up at least one worker slot. This means that the next triggered fetch will always have room for at least one job, whereas in #663 that's not the case. The boolean is always read and written by a single goroutine can avoid adding mutex overhead. It gets set to true either if the fetch limiter fired but we were already full, or if we fetched a full batch of jobs on the last fetch attempt. In either case it's a good signal that it's probably time for us to try to fetch jobs again as soon as we have space for them. If possible, I'd love for you to run your benchmarks again with this branch. My guess is that the benchmarked throughput improvements will be the same (no-op jobs are always very fast), but in real-world use cases this change should be an improvement over your original proposal. |
Tune the client to be more aggressive about fetching when it either just fetched a full batch of jobs, or when it skipped its previously triggered fetch because it was already full. Rather than immediately calling the fetch limiter as in the previous commit, note this situation with a boolean so that we can trigger the fetch limiter _after_ a worker slot has opened up. This should bring more consistent throughput to poll-only mode and in cases where there is a backlog of existing jobs but new ones aren't being actively inserted. This will result in increased fetch load on many installations, with the benefit of increased throughput. As before, `FetchCooldown` still limits how frequently these fetches can occur on each client and can be increased to reduce the amount of fetch querying. Finally, don't run the fetch query if there are no worker slots available.
Running
Before
After
|
// Fetch returned the maximum number of jobs that were requested, | ||
// implying there may be more in the queue. Trigger another fetch when | ||
// slots are available. | ||
p.fetchWhenSlotsAreAvailable = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me how this is triggered only when a full batch has been returned. Should it only be set to true when len(results.jobs) == limit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, no, I totally missed that clause 🤦♂️ Fix incoming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed in #668 / v0.14.1. Thanks for catching this 🙏
I trust your knowledge of how River works better than mine and I didn't look into all the details so I'm sure there is something I missed. It is going to be a few days at least until I can cycle back here and test it on our code or do benchmarks. |
In ce141fa / #664, the client was tuned in an attempt to make it more aggressive about fetching more jobs when it has just fetched a full batch of jobs. Unfortunately, a conditional was missed and it ended up fetching aggressively any time a single job was fetched. This adds the correct conditional so that the `fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous fetch was full, or (b) when the previous fetch was skipped due to being already full of jobs.
In ce141fa / #664, the client was tuned in an attempt to make it more aggressive about fetching more jobs when it has just fetched a full batch of jobs. Unfortunately, a conditional was missed and it ended up fetching aggressively any time a single job was fetched. This adds the correct conditional so that the `fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous fetch was full, or (b) when the previous fetch was skipped due to being already full of jobs.
* make tidy * fix over-aggressive fetching In ce141fa / #664, the client was tuned in an attempt to make it more aggressive about fetching more jobs when it has just fetched a full batch of jobs. Unfortunately, a conditional was missed and it ended up fetching aggressively any time a single job was fetched. This adds the correct conditional so that the `fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous fetch was full, or (b) when the previous fetch was skipped due to being already full of jobs. * prepare v0.14.1
* Immediately fetch another batch when a full set of job is returned Jobs can often be stuck on queues for a long while when jobs are sporadically enqueued in large batches. Prior to this change jobs would be fetched every FetchPollInterval (default 1s) unless new jobs were enqueued. This change reduces that to FetchCooldown (default 100ms) when a full set of jobs is returned implying that there may be more in the queue. * trigger fetch when worker freed up Tune the client to be more aggressive about fetching when it either just fetched a full batch of jobs, or when it skipped its previously triggered fetch because it was already full. Rather than immediately calling the fetch limiter as in the previous commit, note this situation with a boolean so that we can trigger the fetch limiter _after_ a worker slot has opened up. This should bring more consistent throughput to poll-only mode and in cases where there is a backlog of existing jobs but new ones aren't being actively inserted. This will result in increased fetch load on many installations, with the benefit of increased throughput. As before, `FetchCooldown` still limits how frequently these fetches can occur on each client and can be increased to reduce the amount of fetch querying. Finally, don't run the fetch query if there are no worker slots available. --------- Co-authored-by: Chris Gaffney <[email protected]>
* make tidy * fix over-aggressive fetching In ce141fa / riverqueue#664, the client was tuned in an attempt to make it more aggressive about fetching more jobs when it has just fetched a full batch of jobs. Unfortunately, a conditional was missed and it ended up fetching aggressively any time a single job was fetched. This adds the correct conditional so that the `fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous fetch was full, or (b) when the previous fetch was skipped due to being already full of jobs. * prepare v0.14.1
Building on #663 and the benchmarks in #652, this tunes the client to be more aggressive about fetching when it either just fetched a full batch of jobs, or when it skipped its previously triggered fetch because it was already full. Rather than immediately calling the fetch limiter as in the previous commit, note this situation with a boolean so that we can trigger the fetch limiter after a worker slot has opened up.
This should bring more consistent throughput to poll-only mode and in cases where there is a backlog of existing jobs but new ones aren't being actively inserted.
This will result in increased fetch load on many installations, with the benefit of increased throughput. As before,
FetchCooldown
still limits how frequently these fetches can occur on each client and can be increased to reduce the amount of fetch querying.Finally, don't run the fetch query if there are no worker slots available. I'm surprised we were doing this in the first place!