Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry wait for result independently from job creation #1042

Closed
wants to merge 1 commit into from

Conversation

github-christophe-oudar
Copy link
Contributor

@github-christophe-oudar github-christophe-oudar commented Dec 5, 2023

resolves #1045

Problem

When an error occurs either on job creation or waiting for the result, the job creation + wait result step is retried.
Then the underlying wait for result step might "fail" (as it's polling for the result every X seconds) and a network error... can lead to retry the whole job.
If the job isn't idempotent => it leads to a bug (what happened for a coworker).
if the job is idempotent => you likely wasted slot time/BQ resources.

Solution

To solve that, let's split the step in 2 functions that are both retried on their own so that we retry accessing the running job.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@cla-bot cla-bot bot added the cla:yes label Dec 5, 2023
@@ -787,7 +789,7 @@ def reopen_conn_on_error(error):
target=fn,
predicate=_ErrorCounter(self.get_job_retries(conn)).count_error,
sleep_generator=self._retry_generator(),
deadline=self.get_job_retry_deadline_seconds(conn),
timeout=self.get_job_retry_deadline_seconds(conn),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just that the previous field is deprecated in the driver, it's the same behavior

@Kayrnt Kayrnt force-pushed the fix-retry-job branch 4 times, most recently from 36051fb to ccb141d Compare December 6, 2023 16:53
@github-christophe-oudar github-christophe-oudar marked this pull request as ready for review December 6, 2023 16:57
@github-christophe-oudar github-christophe-oudar requested a review from a team as a code owner December 6, 2023 16:57
@McKnight-42
Copy link
Contributor

@github-christophe-oudar wanted to point you to this open PR I have as I think these are related #977

@github-christophe-oudar
Copy link
Contributor Author

@McKnight-42 thank for pointing it out 👍
That solution looks interesting too but I wonder if my approach wouldn't be simpler to fix that problem.
Wouldn't the query fail if you provide an existing running job id?
I see that PR is stale for 1 month, what's the status?
That would be great to move forward on a solution that properly retry on the job status as it looks like I'm not the single person affected.

@McKnight-42
Copy link
Contributor

@github-christophe-oudar The ticket is still being worked on. I did have to set it aside for a bit due to some other work and traveling but it's on my board. I plan to set some time to dig into these pr's and problems a little more over the next few days and will keep you updated.

@McKnight-42 McKnight-42 self-requested a review December 6, 2023 21:39
@github-christophe-oudar
Copy link
Contributor Author

Ok, great to know!
I think both approaches could be combined.
The sooner we can deal with that bug, the better 🙏

Copy link
Contributor

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 15, 2024
Copy link
Contributor

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-1062] [Bug] Retries on wait for result step is recreating the whole job
3 participants