Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Mark-delay" performance improvement to major GC #13580

Draft
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

NickBarnes
Copy link
Contributor

@NickBarnes NickBarnes commented Oct 29, 2024

This is the upstreaming of ocaml-flambda/flambda-backend#2348 ocaml-flambda/flambda-backend#2358 (minor) and ocaml-flambda/flambda-backend#3029 by @stedolan. It introduces a new sweep-only phase at the start of each major GC cycle. This reduces the "latent garbage delay" - the time between a block becoming unreachable and it becoming available for allocation - by approximately half a major GC cycle.

Because marking, including root marking, doesn't take place until part-way through the GC cycle (when we move from sweep-only to mark-and-sweep), the allocation colour is not always MARKED but changes from UNMARKED to MARKED at that point. Effectively we switch from a grey mutator allocating white to a black mutator allocating black.

This PR is in draft because I've just done a fairly mechanical (although manual) patch application; I'm publishing it so that @stedolan and perhaps @kayceesrk can take a look. It passes the whole testsuite on my machine, including the new test (parallel/churn.ml) written by @stedolan for the flambda-backend mark-delay work.

@NickBarnes NickBarnes force-pushed the nick-markdelay branch 2 times, most recently from 31fc961 to 72500e9 Compare October 29, 2024 16:56
@kayceesrk
Copy link
Contributor

kayceesrk commented Oct 29, 2024

Thanks! I’ll have a look tomorrow.

@NickBarnes NickBarnes added the run-multicoretests Makes the CI run multicore tests label Oct 29, 2024
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/signals.c Outdated Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
@NickBarnes NickBarnes force-pushed the nick-markdelay branch 3 times, most recently from b341dd4 to 4a6c7af Compare October 31, 2024 09:58
@NickBarnes
Copy link
Contributor Author

I've addressed @kayceesrk's review comments and rebased.

@NickBarnes NickBarnes marked this pull request as ready for review October 31, 2024 10:27
@NickBarnes
Copy link
Contributor Author

Thanks to @kayceesrk and @stedolan I think this can come out of draft now.

Changes Outdated Show resolved Hide resolved
@kayceesrk
Copy link
Contributor

The code looks ok now.

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

runtime/caml/platform.h Outdated Show resolved Hide resolved
@kayceesrk
Copy link
Contributor

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

any insights on this so far?

@NickBarnes
Copy link
Contributor Author

The Win32 problem is due to an accounting error, observed across platforms, which causes the work_counter to trail further and further behind the alloc_counter. Eventually on 32-bit platforms it differs by enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows, this leads to out-of-memory at (just under) a 2 GiB heap. On 32-bit Linux machines with enough memory, the heap grows past 2GiB, leading the alloc_counter to wrap around once more, pushing work budgets back into positive territory, so the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms.
Taking this PR back into draft mode until I've got a fix.

@NickBarnes
Copy link
Contributor Author

(note: this accounting problem is specific to this PR, and not observed on trunk).

@NickBarnes
Copy link
Contributor Author

NickBarnes commented Nov 9, 2024

I have further diagnosed the accounting problem, and put in a dirty hack to prove my hypothesis.
On the trunk, we maintain alloc_counter as a global estimate of the GC work necessary to keep up with allocation, and work_counter as a global record of the total amount of GC work done. Both of these counters are intnat sized, and on 32-bit platforms they wrap around. For each slice, in each domain, we work until the work_counter exceeds the slice_target - the value that the alloc_counter had at the start of the slice - or (alternatively) we run out of work to do.
In this way, these two counters remain approximately in sync.
However, on this branch, we delay marking until after sweeping has finished on at least one domain, and we always start marking in a fresh slice (after a minor GC). So on the last sweeping slice (on the domain which reaches the end of sweeping first), approximately half of the work expected in the slice is not used, and work_counter slips behind alloc_counter by this amount. At the end of the slice, instead of work_counter slightly exceeding slice_target, or not quite reaching it (if we got to the end of the collection), it falls behind by (on average) half of a slice's amount of work. Depending on the workload, there may be very little marking to do, not enough to catch up on the next slice (for instance, on testsuite/tests/regression/pr5757/pr5757.ml there is very little marking to do).
So, as the process continues, work_counter gradually falls further and further behind alloc_counter. This may lead to larger slice budgets and thus longer (but fewer) GC pauses, but should be otherwise harmless. However (as described in a previous comment), on 32-bit platforms eventually the shortfall is enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows we eventually hit the 2 GiB process memory limit, causing us to run out of memory with a (just under) 2 GiB heap. On 32-bit Linux machines with enough memory, we are able to continue past that point (as there is a 3 GiB process memory limit), advancing alloc_counter until it overtakes work_counter, pushing work budgets back into positive territory. Then the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms. Instead the slice targets grow larger and larger, which may exacerbate the problem as all the sweeping is easily completed in a single slice.

My hack, which should not be merged, demonstrates that addressing this accounting issue fixes the problem by artificially consuming the rest of the slice budget at the point at which sweeping is first completed.
Paging @stedolan and @damiendoligez for thoughts.
I also have a sketch of a refactor of major_collection_slice to make this sort of thing easier to reason about. I hope to make a PR for that in the next week or two.

@kayceesrk
Copy link
Contributor

Would switching to 64-bit counters fix this problem?

@NickBarnes
Copy link
Contributor Author

64-bit counters would prevent the pr5757.ml test failure, but (if my analysis is correct) the underlying accounting problem would remain, which would lead, I think, to unusually long slices.

@NickBarnes
Copy link
Contributor Author

NickBarnes commented Nov 12, 2024

Most workloads will have either enough sweeping or enough marking to reach the slice target in some slice, before the shortfall reaches problematic levels. pr5757.ml is unusual in several distinct ways which provoke this problem:

  • The "live" (reachable) heap is always small, so the marking work is never enough to consume a whole slice (on my test machine, at this commit, with the bytecode backend, we do exactly 8073 words of marking work on every single GC cycle).
  • The allocated heap is small (until we reach the failure point and collection stops), so the sweeping work is completed in a single slice. At this commit, with the bytecode backend, the sweeping is always completed in the first slice of a collection, even at the start of the test when the slice target is smallest.
  • The units of allocation are almost always large (hundreds of thousands of words), making the slice target very large after even the smallest amount of mutator work.
  • It's a single-domain test, so other domains cannot use up the residual work of any slice (such as the mark-requesting slice, on which the first domain completes sweeping).

In fact, very few cycles of this test, out of over 60,000, begin with work_counter closer to alloc_counter than it was at the start of the previous cycle. These are basically the ones in which the random allocation performed by the mutator is unusually small.

Pending a more far-reaching rework of the pacing system, there are a few obvious changes which could address this problem, without the blunt approach of my hack in af5fb77

  • At the start of a slice, notice if the work_counter is much further behind the alloc_counter than could reasonably be accounted for by the current size of the heap, and adjust work_counter accordingly.
  • On single-domain programs, can we mark the roots and enter marking in the same slice as we complete sweeping?

@NickBarnes
Copy link
Contributor Author

This is alloc_counter - work_counter at the start of each cycle on a 64-bit machine.
Screenshot 2024-11-12 at 12 07 17

Zoomed into the first 50 cycles:
Screenshot 2024-11-12 at 12 05 07

Here for comparison are alloc_counter and work_counter on a failing Win32 run. You can see when collection stops, at around slice 21080. Note: x-axis here is slice number, not cycle number.
Screenshot 2024-11-12 at 12 39 27

@NickBarnes
Copy link
Contributor Author

Cycle counts bucketed by alloc_counter increment (since start of previous cycle), less work done, bucket width 10 kwords. It's not quite as simple as my hand-wave explanation in a previous comment.
Screenshot 2024-11-12 at 13 54 01

NickBarnes and others added 2 commits November 14, 2024 16:43
…unter at the start of any slice when it falls very far behind alloc_counter.
@NickBarnes
Copy link
Contributor Author

NickBarnes commented Nov 15, 2024

This still has a failing multicore test, in which we discover that orphaned ephemerons can have the wrong colour bits.

bash-3.2$ OCAMLRUNPARAM="s=4096" _build/default/src/ephemeron/lin_tests.exe -v -s 321821727
### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
random seed: 321821727
generated error fail pass / total     time test name
[ ]   24    0    0   24 / 1000     0.1s Lin Ephemeron stress test with Domain[02] file runtime/major_gc.c; line 402 ### Assertion failed: Has_status_val(v, status)
Trace/BPT trap: 5
bash-3.2$

So still not ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-multicoretests Makes the CI run multicore tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants