Debug bootstrap for intermittent test in periodic job scheduler #271

brandur · 2024-03-13T01:46:47Z

This one's a special CI and logging set up aimed to help me debug #215,
which I'm finding practically impossible to reproduce locally, but which
happens with reasonable frequency in CI. Not intended for merge.

This one's a special CI and logging set up aimed to help me debug #215, which I'm finding practically impossible to reproduce locally, but which happens with reasonable frequency in CI. Not intended for merge.

…ady margin" This one's aimed at fixing the intermittent test described in #215. By reading some additional logging probes in #271, I've been able to see that the failures occur when the two jobs in the test case, 500ms and 1500ms, have their ready times diverge ever so slightly across run iterations, which given enough test runs, can diverge enough so that both jobs aren't ready on the last check where they're supposed to sync up. There's a few different potential ways to solve this one, most involving rewriting the test, but here I'm proposing that we solve it by increasing the "now margin" in the enqueuer from 10ms to 100ms. This is a small margin applied on each run loop that look for jobs that are not quite ready, but so close to ready that we enqueue them anyway. The choice of 10ms originally was somewhat arbitrary, and 100ms isn't a substantially larger number and still makes sense, so I think this is a reasonable resolution. I verified the fix works by pushing it up in the repro bootstrap in #271 that runs with `-race` and high iteration count in the slower GitHub CI environment. It moves the CI matrix from failing on every job on every run to succeeding on every job on every run.

…ady margin" This one's aimed at fixing the intermittent test described in #215. By reading some additional logging probes in #271, I've been able to see that the failures occur when the two jobs in the test case, 500ms and 1500ms, have their ready times diverge ever so slightly across run iterations, which given enough test runs, can diverge enough so that both jobs aren't ready on the last check where they're supposed to sync up. There's a few different potential ways to solve this one, most involving rewriting the test, but here I'm proposing that we solve it by increasing the "now margin" in the enqueuer from 10ms to 100ms. This is a small margin applied on each run loop that look for jobs that are not quite ready, but so close to ready that we enqueue them anyway. The choice of 10ms originally was somewhat arbitrary, and 100ms isn't a substantially larger number and still makes sense, so I think this is a reasonable resolution. I verified the fix works by pushing it up in the repro bootstrap in #271 that runs with `-race` and high iteration count in the slower GitHub CI environment. It moves the CI matrix from failing on every job on every run to succeeding on every job on every run. Fixes #215.

brandur · 2024-03-17T00:47:44Z

Opened #274 with a fix.

…ady margin" (#274) This one's aimed at fixing the intermittent test described in #215. By reading some additional logging probes in #271, I've been able to see that the failures occur when the two jobs in the test case, 500ms and 1500ms, have their ready times diverge ever so slightly across run iterations, which given enough test runs, can diverge enough so that both jobs aren't ready on the last check where they're supposed to sync up. There's a few different potential ways to solve this one, most involving rewriting the test, but here I'm proposing that we solve it by increasing the "now margin" in the enqueuer from 10ms to 100ms. This is a small margin applied on each run loop that look for jobs that are not quite ready, but so close to ready that we enqueue them anyway. The choice of 10ms originally was somewhat arbitrary, and 100ms isn't a substantially larger number and still makes sense, so I think this is a reasonable resolution. I verified the fix works by pushing it up in the repro bootstrap in #271 that runs with `-race` and high iteration count in the slower GitHub CI environment. It moves the CI matrix from failing on every job on every run to succeeding on every job on every run. Fixes #215.

brandur force-pushed the brandur-debug-build-for-intermittency branch 2 times, most recently from 6f1d2ad to 2562ce2 Compare March 17, 2024 00:07

Debug bootstrap for intermittent test in periodic job scheduler

0180bb2

This one's a special CI and logging set up aimed to help me debug #215, which I'm finding practically impossible to reproduce locally, but which happens with reasonable frequency in CI. Not intended for merge.

brandur force-pushed the brandur-debug-build-for-intermittency branch from 2562ce2 to 0180bb2 Compare March 17, 2024 00:16

brandur mentioned this pull request Mar 17, 2024

Fix intermittent test failure on periodic job enqueuer by raising "ready margin" #274

Merged

brandur closed this Mar 17, 2024

brandur deleted the brandur-debug-build-for-intermittency branch March 17, 2024 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug bootstrap for intermittent test in periodic job scheduler #271

Debug bootstrap for intermittent test in periodic job scheduler #271

brandur commented Mar 13, 2024

brandur commented Mar 17, 2024

Debug bootstrap for intermittent test in periodic job scheduler #271

Debug bootstrap for intermittent test in periodic job scheduler #271

Conversation

brandur commented Mar 13, 2024

brandur commented Mar 17, 2024