Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More aggressive refetching when full #664

Merged
merged 2 commits into from
Nov 2, 2024
Merged

More aggressive refetching when full #664

merged 2 commits into from
Nov 2, 2024

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented Nov 2, 2024

Building on #663 and the benchmarks in #652, this tunes the client to be more aggressive about fetching when it either just fetched a full batch of jobs, or when it skipped its previously triggered fetch because it was already full. Rather than immediately calling the fetch limiter as in the previous commit, note this situation with a boolean so that we can trigger the fetch limiter after a worker slot has opened up.

This should bring more consistent throughput to poll-only mode and in cases where there is a backlog of existing jobs but new ones aren't being actively inserted.

This will result in increased fetch load on many installations, with the benefit of increased throughput. As before, FetchCooldown still limits how frequently these fetches can occur on each client and can be increased to reduce the amount of fetch querying.

Finally, don't run the fetch query if there are no worker slots available. I'm surprised we were doing this in the first place!

Jobs can often be stuck on queues for a long while when jobs are
sporadically enqueued in large batches. Prior to this change jobs would
be fetched every FetchPollInterval (default 1s) unless new jobs were
enqueued. This change reduces that to FetchCooldown (default 100ms) when
a full set of jobs is returned implying that there may be more in the
queue.
@bgentry bgentry requested a review from brandur November 2, 2024 16:42
@bgentry
Copy link
Contributor Author

bgentry commented Nov 2, 2024

@gaffneyc while looking more closely at this, I decided to adjust the approach from #663 to hopefully make it a little more efficient. I realized that there's a single producer goroutine responsible for deciding to fetch new jobs and for removing completed ones from the list of running ones. I also realized that the benchmarks in #652 and the solution proposed in #653 are going to be highly dependent on whether or not any jobs complete before the next attempted fetch. If the jobs are synthetic and ~instantaneous, there will always be slots available on the next attempt, but if they're realistic, odds are the fetch limiter may fire again before we have any slots available—and then we're back to polling!

I believe the approach in this PR is an improvement because it relies on a single boolean in the producer to let us delay triggering the fetch limiter until after we've freed up at least one worker slot. This means that the next triggered fetch will always have room for at least one job, whereas in #663 that's not the case. The boolean is always read and written by a single goroutine can avoid adding mutex overhead. It gets set to true either if the fetch limiter fired but we were already full, or if we fetched a full batch of jobs on the last fetch attempt. In either case it's a good signal that it's probably time for us to try to fetch jobs again as soon as we have space for them.

If possible, I'd love for you to run your benchmarks again with this branch. My guess is that the benchmarked throughput improvements will be the same (no-op jobs are always very fast), but in real-world use cases this change should be an improvement over your original proposal.

Tune the client to be more aggressive about fetching when it either just
fetched a full batch of jobs, or when it skipped its previously
triggered fetch because it was already full. Rather than immediately
calling the fetch limiter as in the previous commit, note this situation
with a boolean so that we can trigger the fetch limiter _after_ a worker
slot has opened up.

This should bring more consistent throughput to poll-only mode and in
cases where there is a backlog of existing jobs but new ones aren't
being actively inserted.

This will result in increased fetch load on many installations, with the
benefit of increased throughput. As before, `FetchCooldown` still limits
how frequently these fetches can occur on each client and can be
increased to reduce the amount of fetch querying.

Finally, don't run the fetch query if there are no worker slots
available.
@bgentry
Copy link
Contributor Author

bgentry commented Nov 2, 2024

Running river bench --duration 30s with these config changes, before and after this PR:

FetchCooldown:     100 * time.Millisecond,
FetchPollInterval: time.Second,
PollOnly:          true,

Before

bench: total jobs worked [      60000 ], total jobs inserted [     490000 ], overall job/sec [     2057.0 ], running 29.169128334s

After

bench: total jobs worked [     590002 ], total jobs inserted [     665000 ], overall job/sec [    19738.5 ], running 29.890884375s

@bgentry bgentry merged commit ce141fa into master Nov 2, 2024
14 checks passed
@bgentry bgentry deleted the fast-refetch branch November 2, 2024 19:19
// Fetch returned the maximum number of jobs that were requested,
// implying there may be more in the queue. Trigger another fetch when
// slots are available.
p.fetchWhenSlotsAreAvailable = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how this is triggered only when a full batch has been returned. Should it only be set to true when len(results.jobs) == limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, no, I totally missed that clause 🤦‍♂️ Fix incoming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in #668 / v0.14.1. Thanks for catching this 🙏

@gaffneyc
Copy link
Contributor

gaffneyc commented Nov 4, 2024

I trust your knowledge of how River works better than mine and I didn't look into all the details so I'm sure there is something I missed. It is going to be a few days at least until I can cycle back here and test it on our code or do benchmarks.

bgentry added a commit that referenced this pull request Nov 4, 2024
In ce141fa / #664, the client was tuned in an attempt to make it more
aggressive about fetching more jobs when it has just fetched a full
batch of jobs. Unfortunately, a conditional was missed and it ended up
fetching aggressively any time a single job was fetched.

This adds the correct conditional so that the
`fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous
fetch was full, or (b) when the previous fetch was skipped due to being
already full of jobs.
bgentry added a commit that referenced this pull request Nov 4, 2024
In ce141fa / #664, the client was tuned in an attempt to make it more
aggressive about fetching more jobs when it has just fetched a full
batch of jobs. Unfortunately, a conditional was missed and it ended up
fetching aggressively any time a single job was fetched.

This adds the correct conditional so that the
`fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous
fetch was full, or (b) when the previous fetch was skipped due to being
already full of jobs.
bgentry added a commit that referenced this pull request Nov 4, 2024
* make tidy

* fix over-aggressive fetching

In ce141fa / #664, the client was tuned in an attempt to make it more
aggressive about fetching more jobs when it has just fetched a full
batch of jobs. Unfortunately, a conditional was missed and it ended up
fetching aggressively any time a single job was fetched.

This adds the correct conditional so that the
`fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous
fetch was full, or (b) when the previous fetch was skipped due to being
already full of jobs.

* prepare v0.14.1
tigrato pushed a commit to gravitational/river that referenced this pull request Dec 18, 2024
* Immediately fetch another batch when a full set of job is returned

Jobs can often be stuck on queues for a long while when jobs are
sporadically enqueued in large batches. Prior to this change jobs would
be fetched every FetchPollInterval (default 1s) unless new jobs were
enqueued. This change reduces that to FetchCooldown (default 100ms) when
a full set of jobs is returned implying that there may be more in the
queue.

* trigger fetch when worker freed up

Tune the client to be more aggressive about fetching when it either just
fetched a full batch of jobs, or when it skipped its previously
triggered fetch because it was already full. Rather than immediately
calling the fetch limiter as in the previous commit, note this situation
with a boolean so that we can trigger the fetch limiter _after_ a worker
slot has opened up.

This should bring more consistent throughput to poll-only mode and in
cases where there is a backlog of existing jobs but new ones aren't
being actively inserted.

This will result in increased fetch load on many installations, with the
benefit of increased throughput. As before, `FetchCooldown` still limits
how frequently these fetches can occur on each client and can be
increased to reduce the amount of fetch querying.

Finally, don't run the fetch query if there are no worker slots
available.

---------

Co-authored-by: Chris Gaffney <[email protected]>
tigrato pushed a commit to gravitational/river that referenced this pull request Dec 18, 2024
* make tidy

* fix over-aggressive fetching

In ce141fa / riverqueue#664, the client was tuned in an attempt to make it more
aggressive about fetching more jobs when it has just fetched a full
batch of jobs. Unfortunately, a conditional was missed and it ended up
fetching aggressively any time a single job was fetched.

This adds the correct conditional so that the
`fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous
fetch was full, or (b) when the previous fetch was skipped due to being
already full of jobs.

* prepare v0.14.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants