More aggressive refetching when full #664

bgentry · 2024-11-02T16:42:37Z

Building on #663 and the benchmarks in #652, this tunes the client to be more aggressive about fetching when it either just fetched a full batch of jobs, or when it skipped its previously triggered fetch because it was already full. Rather than immediately calling the fetch limiter as in the previous commit, note this situation with a boolean so that we can trigger the fetch limiter after a worker slot has opened up.

This should bring more consistent throughput to poll-only mode and in cases where there is a backlog of existing jobs but new ones aren't being actively inserted.

This will result in increased fetch load on many installations, with the benefit of increased throughput. As before, FetchCooldown still limits how frequently these fetches can occur on each client and can be increased to reduce the amount of fetch querying.

Finally, don't run the fetch query if there are no worker slots available. I'm surprised we were doing this in the first place!

Jobs can often be stuck on queues for a long while when jobs are sporadically enqueued in large batches. Prior to this change jobs would be fetched every FetchPollInterval (default 1s) unless new jobs were enqueued. This change reduces that to FetchCooldown (default 100ms) when a full set of jobs is returned implying that there may be more in the queue.

bgentry · 2024-11-02T16:51:09Z

@gaffneyc while looking more closely at this, I decided to adjust the approach from #663 to hopefully make it a little more efficient. I realized that there's a single producer goroutine responsible for deciding to fetch new jobs and for removing completed ones from the list of running ones. I also realized that the benchmarks in #652 and the solution proposed in #653 are going to be highly dependent on whether or not any jobs complete before the next attempted fetch. If the jobs are synthetic and ~instantaneous, there will always be slots available on the next attempt, but if they're realistic, odds are the fetch limiter may fire again before we have any slots available—and then we're back to polling!

I believe the approach in this PR is an improvement because it relies on a single boolean in the producer to let us delay triggering the fetch limiter until after we've freed up at least one worker slot. This means that the next triggered fetch will always have room for at least one job, whereas in #663 that's not the case. The boolean is always read and written by a single goroutine can avoid adding mutex overhead. It gets set to true either if the fetch limiter fired but we were already full, or if we fetched a full batch of jobs on the last fetch attempt. In either case it's a good signal that it's probably time for us to try to fetch jobs again as soon as we have space for them.

If possible, I'd love for you to run your benchmarks again with this branch. My guess is that the benchmarked throughput improvements will be the same (no-op jobs are always very fast), but in real-world use cases this change should be an improvement over your original proposal.

Tune the client to be more aggressive about fetching when it either just fetched a full batch of jobs, or when it skipped its previously triggered fetch because it was already full. Rather than immediately calling the fetch limiter as in the previous commit, note this situation with a boolean so that we can trigger the fetch limiter _after_ a worker slot has opened up. This should bring more consistent throughput to poll-only mode and in cases where there is a backlog of existing jobs but new ones aren't being actively inserted. This will result in increased fetch load on many installations, with the benefit of increased throughput. As before, `FetchCooldown` still limits how frequently these fetches can occur on each client and can be increased to reduce the amount of fetch querying. Finally, don't run the fetch query if there are no worker slots available.

bgentry · 2024-11-02T17:36:02Z

Running river bench --duration 30s with these config changes, before and after this PR:

FetchCooldown:     100 * time.Millisecond,
FetchPollInterval: time.Second,
PollOnly:          true,

Before

bench: total jobs worked [      60000 ], total jobs inserted [     490000 ], overall job/sec [     2057.0 ], running 29.169128334s

After

bench: total jobs worked [     590002 ], total jobs inserted [     665000 ], overall job/sec [    19738.5 ], running 29.890884375s

gaffneyc · 2024-11-04T20:02:35Z

producer.go

+				// Fetch returned the maximum number of jobs that were requested,
+				// implying there may be more in the queue. Trigger another fetch when
+				// slots are available.
+				p.fetchWhenSlotsAreAvailable = true


It's not clear to me how this is triggered only when a full batch has been returned. Should it only be set to true when len(results.jobs) == limit?

Ugh, no, I totally missed that clause 🤦‍♂️ Fix incoming.

Should be fixed in #668 / v0.14.1. Thanks for catching this 🙏

gaffneyc · 2024-11-04T20:08:55Z

I trust your knowledge of how River works better than mine and I didn't look into all the details so I'm sure there is something I missed. It is going to be a few days at least until I can cycle back here and test it on our code or do benchmarks.

In ce141fa / #664, the client was tuned in an attempt to make it more aggressive about fetching more jobs when it has just fetched a full batch of jobs. Unfortunately, a conditional was missed and it ended up fetching aggressively any time a single job was fetched. This adds the correct conditional so that the `fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous fetch was full, or (b) when the previous fetch was skipped due to being already full of jobs.

* make tidy * fix over-aggressive fetching In ce141fa / #664, the client was tuned in an attempt to make it more aggressive about fetching more jobs when it has just fetched a full batch of jobs. Unfortunately, a conditional was missed and it ended up fetching aggressively any time a single job was fetched. This adds the correct conditional so that the `fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous fetch was full, or (b) when the previous fetch was skipped due to being already full of jobs. * prepare v0.14.1

* Immediately fetch another batch when a full set of job is returned Jobs can often be stuck on queues for a long while when jobs are sporadically enqueued in large batches. Prior to this change jobs would be fetched every FetchPollInterval (default 1s) unless new jobs were enqueued. This change reduces that to FetchCooldown (default 100ms) when a full set of jobs is returned implying that there may be more in the queue. * trigger fetch when worker freed up Tune the client to be more aggressive about fetching when it either just fetched a full batch of jobs, or when it skipped its previously triggered fetch because it was already full. Rather than immediately calling the fetch limiter as in the previous commit, note this situation with a boolean so that we can trigger the fetch limiter _after_ a worker slot has opened up. This should bring more consistent throughput to poll-only mode and in cases where there is a backlog of existing jobs but new ones aren't being actively inserted. This will result in increased fetch load on many installations, with the benefit of increased throughput. As before, `FetchCooldown` still limits how frequently these fetches can occur on each client and can be increased to reduce the amount of fetch querying. Finally, don't run the fetch query if there are no worker slots available. --------- Co-authored-by: Chris Gaffney <[email protected]>

* make tidy * fix over-aggressive fetching In ce141fa / riverqueue#664, the client was tuned in an attempt to make it more aggressive about fetching more jobs when it has just fetched a full batch of jobs. Unfortunately, a conditional was missed and it ended up fetching aggressively any time a single job was fetched. This adds the correct conditional so that the `fetchWhenSlotsAreAvailable` flag only gets set when (a) the previous fetch was full, or (b) when the previous fetch was skipped due to being already full of jobs. * prepare v0.14.1

bgentry requested a review from brandur November 2, 2024 16:42

bgentry force-pushed the fast-refetch branch from 7c326b2 to d4ac060 Compare November 2, 2024 16:52

bgentry merged commit ce141fa into master Nov 2, 2024
14 checks passed

bgentry deleted the fast-refetch branch November 2, 2024 19:19

bgentry mentioned this pull request Nov 2, 2024

Immediately fetch another batch when a full set of job is returned #663

Closed

gaffneyc reviewed Nov 4, 2024

View reviewed changes

bgentry mentioned this pull request Nov 4, 2024

Fix more aggressive fetch / prepare v0.14.1 #668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More aggressive refetching when full #664

More aggressive refetching when full #664

bgentry commented Nov 2, 2024

bgentry commented Nov 2, 2024

bgentry commented Nov 2, 2024

gaffneyc Nov 4, 2024

bgentry Nov 4, 2024

bgentry Nov 4, 2024

gaffneyc commented Nov 4, 2024

More aggressive refetching when full #664

More aggressive refetching when full #664

Conversation

bgentry commented Nov 2, 2024

bgentry commented Nov 2, 2024

bgentry commented Nov 2, 2024

Before

After

gaffneyc Nov 4, 2024

Choose a reason for hiding this comment

bgentry Nov 4, 2024

Choose a reason for hiding this comment

bgentry Nov 4, 2024

Choose a reason for hiding this comment

gaffneyc commented Nov 4, 2024