Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-498] [Bug] BQ does not retry on 503 #682

Closed
2 tasks done
barberscott opened this issue Apr 26, 2023 · 15 comments · Fixed by #1224 or #1408
Closed
2 tasks done

[ADAP-498] [Bug] BQ does not retry on 503 #682

barberscott opened this issue Apr 26, 2023 · 15 comments · Fixed by #1224 or #1408
Assignees
Labels
bug Something isn't working community This PR is from a community member good_first_issue Good for newcomers

Comments

@barberscott
Copy link

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Current if BigQuery returns a 503 error we do not retry even though BigQuery recommends that as the course of action.

Expected Behavior

This is not a regression but rather an oversight -- 503 errors should be both retryable and reopenable since it indicates a transient unavailable condition in BigQuery

Steps To Reproduce

Transient -- requires intermittent error from BQ.

Relevant log output

No response

Environment

- dbt-core: all 
- dbt-bigquery: all

Additional Context

No response

@barberscott barberscott added bug Something isn't working triage labels Apr 26, 2023
@barberscott barberscott changed the title [Bug] BQ does not retry on 503 [Bug] dbt-bigquery does not retry on 503 Apr 26, 2023
@github-actions github-actions bot changed the title [Bug] dbt-bigquery does not retry on 503 [ADAP-498] [Bug] BQ does not retry on 503 Apr 26, 2023
@dbeatty10
Copy link
Contributor

Thanks for reaching out @barberscott !

We'll put this in our queue.

The solution might be as simple as adding google.cloud.exceptions.ServiceUnavailable to the list here:

RETRYABLE_ERRORS = (
google.cloud.exceptions.ServerError,
google.cloud.exceptions.BadRequest,
google.cloud.exceptions.BadGateway,
ConnectionResetError,
ConnectionError,
)

@dbeatty10 dbeatty10 removed the triage label Apr 27, 2023
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 25, 2023
Copy link
Contributor

github-actions bot commented Nov 1, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2023
@dbeatty10 dbeatty10 reopened this Nov 1, 2023
@dbeatty10 dbeatty10 added good_first_issue Good for newcomers and removed Stale labels Nov 1, 2023
@jx2lee
Copy link
Contributor

jx2lee commented Dec 18, 2023

@dbeatty10
I created an ServiceUnavailable instance and ran the test code (test_is_retryable).

Current: Not added ServiceUnavailable on RETRYABLE_ERRORS.
Result: Test passed.

def test_is_retrievable(self):
        _is_retryable = dbt.adapters.bigquery.connections._is_retryable
        exceptions = dbt.adapters.bigquery.impl.google.cloud.exceptions
        Internal Server Error = Exceptions.Internal Server Error ("Code Abort")
        bad_request_error = Exception.BadRequest("Code is broken")
        connection_error = connection_error("Code broke")
        client_error = Exception.ClientError("Invalid code")
        rate_limit_error = Exception.Forbidden(
            "Code is broken", error=[{"reason": "rateLimitExceeded"}]]
        )
        # add service_unavailable_error
        service_unavailable_error = Exception.ServiceUnavailable("Code is broken")

        self.assertTrue(_is_retryable(internal_server_error))
        self.assertTrue(_is_retryable(bad_request_error))
        self.assertTrue(_is_retryable(connection_error))
        self.assertFalse(_is_retryable(client_error))
        self.assertTrue(_is_retryable(rate_limit_error))
        # passed below assertion
        self.assertTrue(_is_retryable(service_unavailable_error))

RETRYABLE_ERRORS = (
google.cloud.exceptions.ServerError,
google.cloud.exceptions.BadRequest,
google.cloud.exceptions.BadGateway,
ConnectionResetError,
ConnectionError,
)

The ServiceUnavailable class inherits from the ServerError class, so it seems to pass above test.
I'd like to fix this, but is there anything else I look at? 🙏

@dbeatty10
Copy link
Contributor

Adding it to the test_is_retryable test like that makes sense 👍

But ... the thing that is surprising to me: if ServiceUnavailable inherits from ServerError and your modified test passes, then why is this not being retried?

Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable?

@jx2lee Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?

@jx2lee
Copy link
Contributor

jx2lee commented Dec 24, 2023

@dbeatty10

Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable?

no, i expected it's impossible.
we can create error classes with the from_http_status and from_grpc_status functions. (google.api_core.exceptions). error class generated from this functions always be "ServiceUnavailable"



Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?

That issue has never been occured...🙃
I need to more detailed logs when it happened.

IMO, If the issue reporter can't provide more error logs, I think okay to close the issue.

  • 503 code does not return any error class other than ServiceUnavailable
  • The functions that raising error in the googleapis package only generate the ServiceUnavailable

@jx2lee
Copy link
Contributor

jx2lee commented Apr 22, 2024

@dbeatty10
Is there anything else should check?

@rrbarbosa
Copy link

We did hit this recently. We use external-tables on a on-run-start macro. We also use service account impersonation in the dbt profile. While running dbt docs generate on CI environment we got:

('Unable to acquire impersonated credentials', '{\n  "error": {\n    "code": 503,\n    "message": "Authentication backend unavailable.",\n    "status": "UNAVAILABLE"\n  }\n}\n')

Because this happens intermittently on an isolated system, I don't have more logs.

@dbeatty10
Copy link
Contributor

Thanks for this report @rrbarbosa !

Since this is intermittent (and maybe relatively rare also), it has been hard to nail down.

If anyone can provide information to suggest that dbt is not retrying at least once, that would be very helpful 🙏

@dbeatty10
Copy link
Contributor

@jx2lee -- would you be willing to raise a PR with the addition you made to this test case?

I think that would be sufficient for us to establish that the ServiceUnavailable is retryable (which would allow us to close this issue).

@jx2lee
Copy link
Contributor

jx2lee commented May 2, 2024

@dbeatty10 okay, i would create PR included above test code soon!

@jx2lee
Copy link
Contributor

jx2lee commented May 4, 2024

@dbeatty10
I created PR! Could you edit PR body or add comment to make it easier for reviewers to understand?

@OSalama
Copy link

OSalama commented Jun 5, 2024

I'm not sure if this is the same code path, but we are seeing a problem with Dataproc (Python models) that dbt is submitting, where dbt successfully submits the batch job, then, during the polling in dbt-labs/dbt-bigquery/dbt/adapters/bigquery/dataproc/batch.py#poll_batch_job, one of the polling calls returns a 503 that is presumably not retried, and dbt errors the model, even though the dataproc job is still running in the background, and eventually completes successfully.

00:25:50  BigQuery adapter: Submitting batch job with id: 5f6d87c9-4045-4208-8941-03fbb8facf30
00:29:58  Unhandled error while executing target/run/core/models/working_tables/WT_rfm_status.py
503 502:Bad Gateway
00:29:58  58 of 63 ERROR creating python table model working_tables.WT_rfm_status ........ ERROR in 248.55s

We have seen the issue twice in a week, and running dbt-bigquery 1.8.1

@mkielar
Copy link

mkielar commented Jul 10, 2024

Got hit by this issue today, while generating "seed" tables with DBT running in CloudBuild:

"Step #7 - "dbt-seed": ('Unable to acquire impersonated credentials', '{\n  "error": {\n    "code": 503,\n    "message": "The service is currently unavailable.",\n    "status": "UNAVAILABLE"\n  }\n}\n')"

We're using impersonation with dbt-bigquery and it seems IAM was unavailable for a moment. We have no explicit retry configured, so - by the docs - it should retry once, but I see no such thing in the logs.

@colin-rogers-dbt colin-rogers-dbt added the community This PR is from a community member label Oct 11, 2024
@mikealfare
Copy link
Contributor

GH closed this because an attached PR was merged. I think there is more to this, so I'm leaving it open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community This PR is from a community member good_first_issue Good for newcomers
Projects
None yet
8 participants