You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's possible for GoodJob to introduce latency into web requests when running in async mode, especially if the GoodJob job is heavily CPU-bound, as can happen, for example, when rendering out views for Turbo broadcasts
Thread scheduling and priority seems to be a less-than-well understood area of Ruby
GitLab has an open issue about setting job priority based on workload type. It seems like they've categorized their jobs based on whether they are CPU bound, but not yet set a Thread priority
Ruby threads are OS threads; and OS threads are preemptive, meaning the OS is entirely responsible for switching execution between threads. But, because of the GVL (Global VM Lock), the Ruby VM actually has a say in when that switching happens.
The Ruby VM has a default thread "Quantum" of 100ms. That means that the Ruby VM will grant a thread the GVL for a maximum of 100ms before the Ruby VM takes it back and gives it to another thread. That is 100ms of Ruby processing unless the thread goes into IO or sleeps or otherwise the thread releases the GVL on its own.
This is an ok way to balance execution across threads, unless those thread workloads are wildly different (homogenous tasks are always better!)
The dreaded "Tail Latency" of multithreaded behavior can happen, related to the Ruby Thread Quantum, when you have what might otherwise be a very short request, for example:
A request that could be 10ms because it's making ten 1ms calls to Memcached/Redis to fetch some cached values and then returns them (IO-bound Thread)
...but when it's running in a thread next to:
A request that takes 1 second and largely spends its time doing string manipulation, for example a background thread that is taking a bunch of complex hashes and arrays and serializing them into a payload to send to a metrics server. Or rendering slow/big/complex views for Turbo Broadcasts (CPU-bound Thread)
...then the CPU-bound thread will be very greedy with holding the GVL and it will look like
IO-bound Thread: Starts 1ms network request and releases GVL
CPU-bound Thread: Does 100ms of work on the CPU before the GVL is taken back
IO-bound Thread: Gets GVL back and starts next 1ms network request and releases GVL
CPU-bound Thread: Does 100ms of work on the CPU before the GVL is taken back
....
See where this is going? The IO-bound thread is taking waaaaaaay longer than the 10ms it could ideally take if the other thread wasn't so greedy with the GVL.
I wrote a quick script to simulate this (starting with the script Aaron Patterson wrote in the Ruby issue linked above). As you can see the IO-bound Thread took more than 1 second to complete, more than the 10ms we expected!
❯ vernier run --interval 1 -- ruby script.rb
starting profiler with interval 1 and allocation interval 0
fib(36) took 1.3947540000081062 seconds
io_total: 1.099480000033509 seconds
cpu_total: 1.3967440000269562 seconds
#<Vernier::Result 6.430636 seconds, 3 threads, 191429 samples, 294 unique>
written to /var/folders/5y/9zpy_s_n6sd6vv3wr62qvp9m0000gn/T/profile20241126-21607-do86i3.vernier.json.gz
And here's what that looks like in Vernier, where you can see the GVL switch back to the IO-bound Thread every 100ms to do the teensy amount of work before handing back to the CPU-bound Thread:
Example script
defmeasurex=Process.clock_gettime(Process::CLOCK_MONOTONIC)yieldProcess.clock_gettime(Process::CLOCK_MONOTONIC) - xenddeffib(n)ifn < 2nelsefib(n - 2) + fib(n - 1)endend# find fib that takes ~1 secondfib_i=50.times.find{ |i| measure{fib(i)} >= 1}sleep_i=measure{fib(fib_i)}puts"fib(#{fib_i}) took #{sleep_i} seconds"# Simulate a thread that makes ten 1ms IO calls in quick successionio_thread=Thread.new{Thread.current.name="io_thread"io_total=measure{10.times{sleep0.001}}puts"io_total: #{io_total} seconds"}Thread.pass# Simulate a thread that makes a CPU-bound call for 1 secondcpu_thread=Thread.new{Thread.current.name="cpu_thread"cpu_total=measure{fib(fib_i)}puts"cpu_total: #{cpu_total} seconds"}Thread.passio_thread.joincpu_thread.join
How does Thread Priority work:
Ruby Thread Priority "is just hint for Ruby thread scheduler. It may be ignored on some platform." But now that that's out of the way, C Ruby's thread priority is calculated as:
This makes sense because a lower priority (negative) should release its GVL more frequently (and thus be less greedy).
What does this mean for GoodJob
When running jobs async, in the same process as web requests, we should probably lower the priority. Maybe to -3 ?
I dunno if we should allow the priority to be set directly via config or just have a configuration setting like lower_thread_priority = true with a default based on the execution mode that could be overriden.
The text was updated successfully, but these errors were encountered:
It's possible for GoodJob to introduce latency into web requests when running in async mode, especially if the GoodJob job is heavily CPU-bound, as can happen, for example, when rendering out views for Turbo broadcasts
Thread scheduling and priority seems to be a less-than-well understood area of Ruby
An attempt to explain
Ruby threads are OS threads; and OS threads are preemptive, meaning the OS is entirely responsible for switching execution between threads. But, because of the GVL (Global VM Lock), the Ruby VM actually has a say in when that switching happens.
The Ruby VM has a default thread "Quantum" of
100ms
. That means that the Ruby VM will grant a thread the GVL for a maximum of 100ms before the Ruby VM takes it back and gives it to another thread. That is 100ms of Ruby processing unless the thread goes into IO or sleeps or otherwise the thread releases the GVL on its own.This is an ok way to balance execution across threads, unless those thread workloads are wildly different (homogenous tasks are always better!)
The dreaded "Tail Latency" of multithreaded behavior can happen, related to the Ruby Thread Quantum, when you have what might otherwise be a very short request, for example:
...but when it's running in a thread next to:
...then the CPU-bound thread will be very greedy with holding the GVL and it will look like
....
See where this is going? The IO-bound thread is taking waaaaaaay longer than the 10ms it could ideally take if the other thread wasn't so greedy with the GVL.
I wrote a quick script to simulate this (starting with the script Aaron Patterson wrote in the Ruby issue linked above). As you can see the IO-bound Thread took more than 1 second to complete, more than the 10ms we expected!
❯ vernier run --interval 1 -- ruby script.rb starting profiler with interval 1 and allocation interval 0 fib(36) took 1.3947540000081062 seconds io_total: 1.099480000033509 seconds cpu_total: 1.3967440000269562 seconds #<Vernier::Result 6.430636 seconds, 3 threads, 191429 samples, 294 unique> written to /var/folders/5y/9zpy_s_n6sd6vv3wr62qvp9m0000gn/T/profile20241126-21607-do86i3.vernier.json.gz
And here's what that looks like in Vernier, where you can see the GVL switch back to the IO-bound Thread every 100ms to do the teensy amount of work before handing back to the CPU-bound Thread:
Example script
How does Thread Priority work:
Ruby Thread Priority "is just hint for Ruby thread scheduler. It may be ignored on some platform." But now that that's out of the way, C Ruby's thread priority is calculated as:
The number of bit-shifts of the default Thread Quantum (100ms). Meaning either multiplying (if positive) or dividing (if negative) by powers of 2.
This makes sense because a lower priority (negative) should release its GVL more frequently (and thus be less greedy).
What does this mean for GoodJob
When running jobs async, in the same process as web requests, we should probably lower the priority. Maybe to
-3
?I dunno if we should allow the priority to be set directly via config or just have a configuration setting like
lower_thread_priority = true
with a default based on the execution mode that could be overriden.The text was updated successfully, but these errors were encountered: