"Mark-delay" performance improvement to major GC #13580

NickBarnes · 2024-10-29T16:50:08Z

This is the upstreaming of ocaml-flambda/flambda-backend#2348 ocaml-flambda/flambda-backend#2358 (minor) and ocaml-flambda/flambda-backend#3029 by @stedolan. It introduces a new sweep-only phase at the start of each major GC cycle. This reduces the "latent garbage delay" - the time between a block becoming unreachable and it becoming available for allocation - by approximately half a major GC cycle.

Because marking, including root marking, doesn't take place until part-way through the GC cycle (when we move from sweep-only to mark-and-sweep), the allocation colour is not always MARKED but changes from UNMARKED to MARKED at that point. Effectively we switch from a grey mutator allocating white to a black mutator allocating black.

This PR is in draft because I've just done a fairly mechanical (although manual) patch application; I'm publishing it so that @stedolan and perhaps @kayceesrk can take a look. It passes the whole testsuite on my machine, including the new test (parallel/churn.ml) written by @stedolan for the flambda-backend mark-delay work.

kayceesrk · 2024-10-29T16:59:02Z

Thanks! I’ll have a look tomorrow.

runtime/caml/domain.h

runtime/major_gc.c

runtime/signals.c

runtime/major_gc.c

NickBarnes · 2024-10-31T09:58:51Z

I've addressed @kayceesrk's review comments and rebased.

NickBarnes · 2024-10-31T10:28:01Z

Thanks to @kayceesrk and @stedolan I think this can come out of draft now.

Changes

kayceesrk · 2024-10-31T15:05:13Z

The code looks ok now.

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

runtime/caml/platform.h

kayceesrk · 2024-11-05T01:30:24Z

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

any insights on this so far?

NickBarnes · 2024-11-08T14:43:45Z

The Win32 problem is due to an accounting error, observed across platforms, which causes the work_counter to trail further and further behind the alloc_counter. Eventually on 32-bit platforms it differs by enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows, this leads to out-of-memory at (just under) a 2 GiB heap. On 32-bit Linux machines with enough memory, the heap grows past 2GiB, leading the alloc_counter to wrap around once more, pushing work budgets back into positive territory, so the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms.
Taking this PR back into draft mode until I've got a fix.

NickBarnes · 2024-11-08T14:44:17Z

(note: this accounting problem is specific to this PR, and not observed on trunk).

NickBarnes · 2024-11-09T08:25:55Z

I have further diagnosed the accounting problem, and put in a dirty hack to prove my hypothesis.
On the trunk, we maintain alloc_counter as a global estimate of the GC work necessary to keep up with allocation, and work_counter as a global record of the total amount of GC work done. Both of these counters are intnat sized, and on 32-bit platforms they wrap around. For each slice, in each domain, we work until the work_counter exceeds the slice_target - the value that the alloc_counter had at the start of the slice - or (alternatively) we run out of work to do.
In this way, these two counters remain approximately in sync.
However, on this branch, we delay marking until after sweeping has finished on at least one domain, and we always start marking in a fresh slice (after a minor GC). So on the last sweeping slice (on the domain which reaches the end of sweeping first), approximately half of the work expected in the slice is not used, and work_counter slips behind alloc_counter by this amount. At the end of the slice, instead of work_counter slightly exceeding slice_target, or not quite reaching it (if we got to the end of the collection), it falls behind by (on average) half of a slice's amount of work. Depending on the workload, there may be very little marking to do, not enough to catch up on the next slice (for instance, on testsuite/tests/regression/pr5757/pr5757.ml there is very little marking to do).
So, as the process continues, work_counter gradually falls further and further behind alloc_counter. This may lead to larger slice budgets and thus longer (but fewer) GC pauses, but should be otherwise harmless. However (as described in a previous comment), on 32-bit platforms eventually the shortfall is enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows we eventually hit the 2 GiB process memory limit, causing us to run out of memory with a (just under) 2 GiB heap. On 32-bit Linux machines with enough memory, we are able to continue past that point (as there is a 3 GiB process memory limit), advancing alloc_counter until it overtakes work_counter, pushing work budgets back into positive territory. Then the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms. Instead the slice targets grow larger and larger, which may exacerbate the problem as all the sweeping is easily completed in a single slice.

My hack, which should not be merged, demonstrates that addressing this accounting issue fixes the problem by artificially consuming the rest of the slice budget at the point at which sweeping is first completed.
Paging @stedolan and @damiendoligez for thoughts.
I also have a sketch of a refactor of major_collection_slice to make this sort of thing easier to reason about. I hope to make a PR for that in the next week or two.

kayceesrk · 2024-11-09T12:54:52Z

Would switching to 64-bit counters fix this problem?

NickBarnes · 2024-11-09T13:46:49Z

64-bit counters would prevent the pr5757.ml test failure, but (if my analysis is correct) the underlying accounting problem would remain, which would lead, I think, to unusually long slices.

NickBarnes · 2024-11-12T11:07:00Z

Most workloads will have either enough sweeping or enough marking to reach the slice target in some slice, before the shortfall reaches problematic levels. pr5757.ml is unusual in several distinct ways which provoke this problem:

The "live" (reachable) heap is always small, so the marking work is never enough to consume a whole slice (on my test machine, at this commit, with the bytecode backend, we do exactly 8073 words of marking work on every single GC cycle).
The allocated heap is small (until we reach the failure point and collection stops), so the sweeping work is completed in a single slice. At this commit, with the bytecode backend, the sweeping is always completed in the first slice of a collection, even at the start of the test when the slice target is smallest.
The units of allocation are almost always large (hundreds of thousands of words), making the slice target very large after even the smallest amount of mutator work.
It's a single-domain test, so other domains cannot use up the residual work of any slice (such as the mark-requesting slice, on which the first domain completes sweeping).

In fact, very few cycles of this test, out of over 60,000, begin with work_counter closer to alloc_counter than it was at the start of the previous cycle. These are basically the ones in which the random allocation performed by the mutator is unusually small.

Pending a more far-reaching rework of the pacing system, there are a few obvious changes which could address this problem, without the blunt approach of my hack in af5fb77

At the start of a slice, notice if the work_counter is much further behind the alloc_counter than could reasonably be accounted for by the current size of the heap, and adjust work_counter accordingly.
On single-domain programs, can we mark the roots and enter marking in the same slice as we complete sweeping?

NickBarnes · 2024-11-12T12:40:59Z

This is alloc_counter - work_counter at the start of each cycle on a 64-bit machine.

Zoomed into the first 50 cycles:

Here for comparison are alloc_counter and work_counter on a failing Win32 run. You can see when collection stops, at around slice 21080. Note: x-axis here is slice number, not cycle number.

NickBarnes · 2024-11-12T13:57:27Z

Cycle counts bucketed by alloc_counter increment (since start of previous cycle), less work done, bucket width 10 kwords. It's not quite as simple as my hand-wave explanation in a previous comment.

Co-authored-by: Stephen Dolan <[email protected]>

…unter at the start of any slice when it falls very far behind alloc_counter.

NickBarnes · 2024-11-15T10:01:03Z

This still has a failing multicore test, in which we discover that orphaned ephemerons can have the wrong colour bits.

bash-3.2$ OCAMLRUNPARAM="s=4096" _build/default/src/ephemeron/lin_tests.exe -v -s 321821727
### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
random seed: 321821727
generated error fail pass / total     time test name
[ ]   24    0    0   24 / 1000     0.1s Lin Ephemeron stress test with Domain[02] file runtime/major_gc.c; line 402 ### Assertion failed: Has_status_val(v, status)
Trace/BPT trap: 5
bash-3.2$

So still not ready for review.

…y to ocaml/ocaml#13580.

…caml/ocaml#13580.

…caml/ocaml#13580. Co-authored-by: Stephen Dolan <[email protected]>

NickBarnes force-pushed the nick-markdelay branch 2 times, most recently from 31fc961 to 72500e9 Compare October 29, 2024 16:56

NickBarnes added the run-multicoretests Makes the CI run multicore tests label Oct 29, 2024

NickBarnes force-pushed the nick-markdelay branch from 72500e9 to 9af1835 Compare October 29, 2024 17:12