Lower Ruby Thread priority for jobs by default when running in Async mode #1554

bensheldon · 2024-11-26T17:22:49Z

It's possible for GoodJob to introduce latency into web requests when running in async mode, especially if the GoodJob job is heavily CPU-bound, as can happen, for example, when rendering out views for Turbo broadcasts

Thread scheduling and priority seems to be a less-than-well understood area of Ruby

https://ivoanjo.me/blog/2023/07/23/understanding-the-ruby-global-vm-lock-by-observing-it/ goes into a lot of depth
- Some related discussion on Reddit with familiar user names: https://www.reddit.com/r/ruby/comments/16xsjxl/threads_vs_fibers/
GitLab has an open issue about setting job priority based on workload type. It seems like they've categorized their jobs based on whether they are CPU bound, but not yet set a Thread priority
The Rails issue about setting default Puma thread count which sorta glances off this topic: Set a new default for the Puma thread count rails/rails#50450
An issue opened on Rails this month about configuring the Thread Quantum separately from priority: https://bugs.ruby-lang.org/issues/20861

An attempt to explain

Ruby threads are OS threads; and OS threads are preemptive, meaning the OS is entirely responsible for switching execution between threads. But, because of the GVL (Global VM Lock), the Ruby VM actually has a say in when that switching happens.

The Ruby VM has a default thread "Quantum" of 100ms. That means that the Ruby VM will grant a thread the GVL for a maximum of 100ms before the Ruby VM takes it back and gives it to another thread. That is 100ms of Ruby processing unless the thread goes into IO or sleeps or otherwise the thread releases the GVL on its own.

This is an ok way to balance execution across threads, unless those thread workloads are wildly different (homogenous tasks are always better!)

The dreaded "Tail Latency" of multithreaded behavior can happen, related to the Ruby Thread Quantum, when you have what might otherwise be a very short request, for example:

A request that could be 10ms because it's making ten 1ms calls to Memcached/Redis to fetch some cached values and then returns them (IO-bound Thread)

...but when it's running in a thread next to:

A request that takes 1 second and largely spends its time doing string manipulation, for example a background thread that is taking a bunch of complex hashes and arrays and serializing them into a payload to send to a metrics server. Or rendering slow/big/complex views for Turbo Broadcasts (CPU-bound Thread)

...then the CPU-bound thread will be very greedy with holding the GVL and it will look like

IO-bound Thread: Starts 1ms network request and releases GVL
CPU-bound Thread: Does 100ms of work on the CPU before the GVL is taken back
IO-bound Thread: Gets GVL back and starts next 1ms network request and releases GVL
CPU-bound Thread: Does 100ms of work on the CPU before the GVL is taken back
....

See where this is going? The IO-bound thread is taking waaaaaaay longer than the 10ms it could ideally take if the other thread wasn't so greedy with the GVL.

I wrote a quick script to simulate this (starting with the script Aaron Patterson wrote in the Ruby issue linked above). As you can see the IO-bound Thread took more than 1 second to complete, more than the 10ms we expected!

❯ vernier run --interval 1 -- ruby script.rb
starting profiler with interval 1 and allocation interval 0
fib(36) took 1.3947540000081062 seconds
io_total: 1.099480000033509 seconds
cpu_total: 1.3967440000269562 seconds
#<Vernier::Result 6.430636 seconds, 3 threads, 191429 samples, 294 unique>
written to /var/folders/5y/9zpy_s_n6sd6vv3wr62qvp9m0000gn/T/profile20241126-21607-do86i3.vernier.json.gz

And here's what that looks like in Vernier, where you can see the GVL switch back to the IO-bound Thread every 100ms to do the teensy amount of work before handing back to the CPU-bound Thread:

Example script

def measure
  x = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  yield
  Process.clock_gettime(Process::CLOCK_MONOTONIC) - x
end

def fib(n)
  if n < 2
    n
  else
    fib(n - 2) + fib(n - 1)
  end
end

# find fib that takes ~1 second
fib_i = 50.times.find { |i| measure { fib(i) } >= 1 }
sleep_i = measure { fib(fib_i) }

puts "fib(#{fib_i}) took #{sleep_i} seconds"

# Simulate a thread that makes ten 1ms IO calls in quick succession
io_thread = Thread.new {
  Thread.current.name = "io_thread"
  io_total = measure {
    10.times { sleep 0.001 }
  }
  puts "io_total: #{io_total} seconds"
}
Thread.pass

# Simulate a thread that makes a CPU-bound call for 1 second
cpu_thread = Thread.new {
  Thread.current.name = "cpu_thread"
  cpu_total = measure {
    fib(fib_i)
  }
  puts "cpu_total: #{cpu_total} seconds"
}
Thread.pass

io_thread.join
cpu_thread.join

How does Thread Priority work:

Ruby Thread Priority "is just hint for Ruby thread scheduler. It may be ignored on some platform." But now that that's out of the way, C Ruby's thread priority is calculated as:

The number of bit-shifts of the default Thread Quantum (100ms). Meaning either multiplying (if positive) or dividing (if negative) by powers of 2.

Thread#priority	Calculation	Result
-N	100ms / (2^N)
-3	100ms / (2^3)	12.5ms
-2	100ms / (2^2)	25ms
-1	100ms / (2^1)	50ms
0	100ms	100ms
1	100ms * (2^1)	200ms
2	100ms * (2^2)	400ms
3	100ms * (2^3)	800ms
N	100ms * (2^N)

This makes sense because a lower priority (negative) should release its GVL more frequently (and thus be less greedy).

What does this mean for GoodJob

When running jobs async, in the same process as web requests, we should probably lower the priority. Maybe to -3 ?

I dunno if we should allow the priority to be set directly via config or just have a configuration setting like lower_thread_priority = true with a default based on the execution mode that could be overriden.

The text was updated successfully, but these errors were encountered:

mperham · 2024-11-27T01:16:10Z

Thanks for doing this research Ben. Lower priority seems to make sense when embedding in another process.

github-project-automation bot added this to GoodJob Backlog v2 Nov 26, 2024

github-project-automation bot moved this to Inbox in GoodJob Backlog v2 Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower Ruby Thread priority for jobs by default when running in Async mode #1554

Lower Ruby Thread priority for jobs by default when running in Async mode #1554

bensheldon commented Nov 26, 2024 •

edited

Loading

mperham commented Nov 27, 2024

Lower Ruby Thread priority for jobs by default when running in Async mode #1554

Lower Ruby Thread priority for jobs by default when running in Async mode #1554

Comments

bensheldon commented Nov 26, 2024 • edited Loading

An attempt to explain

How does Thread Priority work:

What does this mean for GoodJob

mperham commented Nov 27, 2024

bensheldon commented Nov 26, 2024 •

edited

Loading