Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug bootstrap for intermittent test in periodic job scheduler #271

Closed
wants to merge 1 commit into from

Conversation

brandur
Copy link
Contributor

@brandur brandur commented Mar 13, 2024

This one's a special CI and logging set up aimed to help me debug #215,
which I'm finding practically impossible to reproduce locally, but which
happens with reasonable frequency in CI. Not intended for merge.

@brandur brandur force-pushed the brandur-debug-build-for-intermittency branch 2 times, most recently from 6f1d2ad to 2562ce2 Compare March 17, 2024 00:07
This one's a special CI and logging set up aimed to help me debug #215,
which I'm finding practically impossible to reproduce locally, but which
happens with reasonable frequency in CI. Not intended for merge.
@brandur brandur force-pushed the brandur-debug-build-for-intermittency branch from 2562ce2 to 0180bb2 Compare March 17, 2024 00:16
brandur added a commit that referenced this pull request Mar 17, 2024
…ady margin"

This one's aimed at fixing the intermittent test described in #215. By
reading some additional logging probes in #271, I've been able to see
that the failures occur when the two jobs in the test case, 500ms and
1500ms, have their ready times diverge ever so slightly across run
iterations, which given enough test runs, can diverge enough so that
both jobs aren't ready on the last check where they're supposed to sync
up.

There's a few different potential ways to solve this one, most involving
rewriting the test, but here I'm proposing that we solve it by
increasing the "now margin" in the enqueuer from 10ms to 100ms. This is
a small margin applied on each run loop that look for jobs that are not
quite ready, but so close to ready that we enqueue them anyway. The
choice of 10ms originally was somewhat arbitrary, and 100ms isn't a
substantially larger number and still makes sense, so I think this is a
reasonable resolution.

I verified the fix works by pushing it up in the repro bootstrap in #271
that runs with `-race` and high iteration count in the slower GitHub CI
environment. It moves the CI matrix from failing on every job on every
run to succeeding on every job on every run.
brandur added a commit that referenced this pull request Mar 17, 2024
…ady margin"

This one's aimed at fixing the intermittent test described in #215. By
reading some additional logging probes in #271, I've been able to see
that the failures occur when the two jobs in the test case, 500ms and
1500ms, have their ready times diverge ever so slightly across run
iterations, which given enough test runs, can diverge enough so that
both jobs aren't ready on the last check where they're supposed to sync
up.

There's a few different potential ways to solve this one, most involving
rewriting the test, but here I'm proposing that we solve it by
increasing the "now margin" in the enqueuer from 10ms to 100ms. This is
a small margin applied on each run loop that look for jobs that are not
quite ready, but so close to ready that we enqueue them anyway. The
choice of 10ms originally was somewhat arbitrary, and 100ms isn't a
substantially larger number and still makes sense, so I think this is a
reasonable resolution.

I verified the fix works by pushing it up in the repro bootstrap in #271
that runs with `-race` and high iteration count in the slower GitHub CI
environment. It moves the CI matrix from failing on every job on every
run to succeeding on every job on every run.

Fixes #215.
@brandur
Copy link
Contributor Author

brandur commented Mar 17, 2024

Opened #274 with a fix.

@brandur brandur closed this Mar 17, 2024
@brandur brandur deleted the brandur-debug-build-for-intermittency branch March 17, 2024 00:47
brandur added a commit that referenced this pull request Mar 17, 2024
…ady margin" (#274)

This one's aimed at fixing the intermittent test described in #215. By
reading some additional logging probes in #271, I've been able to see
that the failures occur when the two jobs in the test case, 500ms and
1500ms, have their ready times diverge ever so slightly across run
iterations, which given enough test runs, can diverge enough so that
both jobs aren't ready on the last check where they're supposed to sync
up.

There's a few different potential ways to solve this one, most involving
rewriting the test, but here I'm proposing that we solve it by
increasing the "now margin" in the enqueuer from 10ms to 100ms. This is
a small margin applied on each run loop that look for jobs that are not
quite ready, but so close to ready that we enqueue them anyway. The
choice of 10ms originally was somewhat arbitrary, and 100ms isn't a
substantially larger number and still makes sense, so I think this is a
reasonable resolution.

I verified the fix works by pushing it up in the repro bootstrap in #271
that runs with `-race` and high iteration count in the slower GitHub CI
environment. It moves the CI matrix from failing on every job on every
run to succeeding on every job on every run.

Fixes #215.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant