Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Django errors are represented in Error Tracking #647

Closed
4 tasks
Tracked by #706
robrap opened this issue May 20, 2024 · 15 comments
Closed
4 tasks
Tracked by #706

Ensure Django errors are represented in Error Tracking #647

robrap opened this issue May 20, 2024 · 15 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented May 20, 2024

Django views that raise exceptions are not showing up in Datadog's "Error Tracking" feature. This is because the service entry span is not being annotated with error.stack, error.message, and error.type span tags, which are all required for that feature. It appears this is just a gap in ddtrace's Django support.

Acceptance Criteria:

  • Ensure that when a view raises an exception, it is represented in the Error Tracking page.
  • Announce the fix to teams.
  • Optional: Create ticket for ensuring that caught exceptions can also be sent to Error Tracking (possibly a modification to our implementation of record_exception in edx-django-utils' DatadogBackend).
  • Optional: Create ticket for observability of non-exception-based errors (such as 404, 403, etc.) See notes below re: DRF.

Implementation details:

  • We likely want to modify edx-django-utils in the following ways:
    • Modify (or create a new) method in TelemetryBackend to attach error information to the root span.
    • Modify (or create a new) middleware with a process_exception method for detecting these errors and passing it to the telemetry backends.

Notes:

  • Private ticket with Datadog: https://help.datadoghq.com/hc/en-us/requests/1707057
  • We originally thought this was related to DRF, but it appears to affect DRF and non-DRF views alike. However, here's a link to our DRF exception handler code (where we override DRF's exception handler in edx-platform) in case it turns out to be relevant. Docs: https://www.django-rest-framework.org/api-guide/exceptions/
    • This might happen if we want to do something with NotFound exceptions that have been converted to not-found 404 responses by DRF and that would not be raised by the view anyhow.
    • It may also prove to be difficult since there's not an obvious way to chain DRF exception handlers, and we already have one registered.
  • We're not entirely sure what criteria New Relic uses for exposing a transaction in its Error Inbox.
@robrap robrap added this to Arch-BOM May 20, 2024
@robrap robrap converted this from a draft issue May 20, 2024
@robrap robrap changed the title Good enough handling of DRF errors in Datadog Fix handling of DRF errors in Datadog May 20, 2024
@robrap
Copy link
Contributor Author

robrap commented May 20, 2024

Started a thread on Slack with DD:

We noticed together in the Office Hours that some of the errors did not have error.type and error.stack. You mentioned https://docs.datadoghq.com/tracing/error_tracking/#use-span-tags-to-track-error-spans (I think), or maybe some other page. I think @timmc-edx may be creating a DD support ticket around this, because it seems like this should all be handled by the ddtrace default instrumentation, but let us know if there is something else you think we're supposed to be doing. Thank you.

Here is the existing DD support ticket that mentions this issue, and needs to be spun off into a separate support ticket.

@robrap
Copy link
Contributor Author

robrap commented May 20, 2024

I'm not clear if #625 will be more important once we fix all these issues, or only if we do alerting of Error Tracking issues.

@robrap
Copy link
Contributor Author

robrap commented May 20, 2024

From the Slack thread noted above:

@robrap
Copy link
Contributor Author

robrap commented May 20, 2024

There's more updates on the Slack thread (so please read it in addition to these comments). It looks like Error Tracking may only work for Service Entry span errors.

@timmc-edx timmc-edx moved this from Prioritized to In Progress in Arch-BOM May 21, 2024
@timmc-edx timmc-edx self-assigned this May 21, 2024
@timmc-edx
Copy link
Member

Comparing NR and DD error listings...

  • In DD's Error Tracking we see almost nothing for prod LMS. For the past hour there's just builtins.RecursionError and builtins.AttributeError -- both, absurdly, with count=0.
  • In NR's Error Inbox the most common error is rest_framework.exceptions:ValidationError in openedx.core.djangoapps.embargo.views:CheckCourseAccessView.get. Most of the errors are in DRF endpoints, but not all—a smaller count are in social_django.views:complete or other endpoints in that package. The issues with Error Tracking may not be specific to DRF.

Looking for errors in DD that might not be handled properly:

  • In Datadog when I query for env:prod service:edx-edxapp-lms resource_name:"GET openedx.core.djangoapps.embargo.views.CheckCourseAccessView" I see a mix of 200 OK and 400 Bad Request. These actually do have full error information on the view span, including a stack trace and error.message=[ErrorDetail(string='Missing parameters', code='invalid')]. Missing error info isn't the problem here. (These also aren't necessarily errors on our part, and are probably just that, Bad Requests.)
  • I can more broadly find traces with a suppressed error by looking for env:prod service:edx-edxapp-lms operation_name:django.request status:ok (service entry spans) && status:error (all spans). I can't seem to get table/list grouping, though. (Server error?) But they all seem to be 401, 403, and 404 responses. It's not apparent to me that these should appear in the Error Tracking page.
  • 5xx responses are unambiguous, but env:prod service:edx-edxapp-lms @http.status_code:5* status:ok doesn't turn anything up.

Basically, I'm having trouble seeing the problem here.

@robrap
Copy link
Contributor Author

robrap commented May 22, 2024

@timmc-edx:

Basically, I'm having trouble seeing the problem here.

I don't understand this statement. I think your comments comparing DD and NR make the problem clear, right?

Are you trying to say you don't see what the solution is? Or you don't see what the root cause is? Or you don't see why more errors aren't showing up under Error Tracking? Or are you trying to say that New Relic has the problem, and we shouldn't be seeing these errors in DD? Or what did you mean? :)

Once you are more clear on the problem, can you spin off an appropriate DD support ticket and have DD help us understand how to resolve? Thank you.

@timmc-edx
Copy link
Member

Well, the comparison certainly shows that NR's Error Inbox and DD's Error Tracking don't show the same thing. But I don't think that Datadog is wrong here, it's just showing a different selection. And that selection seems reasonable to me—operationally, I mostly don't care about 4xx errors unless there's a surge, but for 5xx errors I want to know about everyone of them.

@robrap
Copy link
Contributor Author

robrap commented May 23, 2024

Thank you for clarifying. You are welcome to close this if you are satisfied, but I’ve got some thoughts that could relate to spinning off tickets:

  1. Do you think a small note of the difference is warranted in the bow-to use DD doc?
  2. DD has a recommended error monitor based on issues in the Error Inbox. How do your findings affect how we should monitor (and possibly name the monitors) related to errors?

@robrap
Copy link
Contributor Author

robrap commented May 23, 2024

Additionally, do we know how to promote an error to the error inbox if we wanted to, and is that useful to have documented?

@timmc-edx
Copy link
Member

timmc-edx commented May 23, 2024

  • Yeah, it's probably worth a note in the migration or other how-to doc. I'll do so.
  • I think we should continue monitoring largely as we have been.
  • I do want to do another comparison but for 5xx errors specifically, to ensure they're getting marked appropriately. (I feel like there are some missing from Error Tracking but I need to confirm.)
  • I think we can promote something to Error Tracking by setting all of the appropriate tags on the root span (error.message etc.)

@timmc-edx
Copy link
Member

timmc-edx commented May 23, 2024

Spot-checking 5xx from prod-lms in DD, I found https://app.datadoghq.com/apm/trace/664f66d200000000830052761f18d571 -- GET lms.djangoapps.courseware.views.views.jump_to_id where the root span is marked as an error but says "Missing error message and stack trace". I don't see any error properties on it. The actual view span does have a proper error: xmodule.modulestore.exceptions.ItemNotFoundError, and error.message, and stack trace. (None of the intervening middleware spans is marked as in error.) It doesn't show up in Error Tracking. This is not a DRF view.

EDIT: Tracking in https://help.datadoghq.com/hc/en-us/requests/1707057

@timmc-edx
Copy link
Member

I think what's going on here is that when a Django view raises an exception, Datadog is failing to annotate the service entry span with error stack, message, and type. We might need to add a middleware as a workaround until they can fill that gap.

@robrap
Copy link
Contributor Author

robrap commented May 28, 2024

@timmc-edx: Note that exception handling for DRF might need to be covered in a DRF handler, and not via the middleware. I'm not sure if we actually have a problem with DRF and non-DRF? See our ignored error handling code in edx-platform for examples of the handler I am referring to.

@timmc-edx
Copy link
Member

Here's an example of a DRF view that has the same problem: https://app.datadoghq.com/apm/trace/6656276100000000c495bf4489c97e42 (view raises exception, view span has appropriate error tags, root span is a 5xx with status:error but no error tags).

So I think this actually isn't a DRF issue at all. Probably we can work around this by adding a custom middleware (or adjusting an existing one).

@robrap robrap changed the title Fix handling of DRF errors in Datadog Fix handling of Django and DRF errors in Datadog May 28, 2024
@robrap
Copy link
Contributor Author

robrap commented May 28, 2024

[inform] See openedx/edx-platform#26980 for an example of how we might need to handle exceptions for DRF and Django in two ways (through middleware and through a DRF exception handler).

@timmc-edx timmc-edx changed the title Fix handling of Django and DRF errors in Datadog Ensure Django errors are represented in Error Tracking May 28, 2024
@robrap robrap moved this from In Progress to Groomed in Arch-BOM Jun 10, 2024
@timmc-edx timmc-edx moved this from Groomed to In Progress in Arch-BOM Jul 17, 2024
@timmc-edx timmc-edx moved this from In Progress to Groomed in Arch-BOM Jul 17, 2024
@dianakhuang dianakhuang self-assigned this Aug 22, 2024
dianakhuang added a commit to openedx/edx-django-utils that referenced this issue Sep 27, 2024
Datadog has issues with tagging the root span
with error information. This change adds the functionality
to tag the root span on exceptions.

This also renames the CachedCustomMonitoringMiddleware into
MonitoringSupportMiddleware so that it can be a central place
for most monitoring middleware changes instead of adding a new
one every time. At some point, CachedCustomMonitoringMiddleware
will be removed.

edx/edx-arch-experiments#647
dianakhuang added a commit to openedx/edx-django-utils that referenced this issue Sep 27, 2024
Datadog has issues with tagging the root span
with error information. This change adds the functionality
to tag the root span on exceptions.

This also renames the CachedCustomMonitoringMiddleware into
MonitoringSupportMiddleware so that it can be a central place
for most monitoring middleware changes instead of adding a new
one every time. At some point, CachedCustomMonitoringMiddleware
will be removed.

edx/edx-arch-experiments#647
dianakhuang added a commit to openedx/edx-django-utils that referenced this issue Oct 1, 2024
Datadog has issues with tagging the root span
with error information. This change adds the functionality
to tag the root span on exceptions.

This also renames the CachedCustomMonitoringMiddleware into
MonitoringSupportMiddleware so that it can be a central place
for most monitoring middleware changes instead of adding a new
one every time. At some point, CachedCustomMonitoringMiddleware
will be removed.

edx/edx-arch-experiments#647
@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM Oct 7, 2024
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

3 participants