-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve kafka consumer dropped traces errors in Datadog #736
Comments
Created ticket: https://help.datadoghq.com/hc/en-us/requests/1805067 |
After thinking about how to respond to the DD ticket, and looking at logs more closely, I decided to open an SRE ticket for further investigation: https://2u-internal.atlassian.net/browse/GSRE-1988. |
I tried looking in AWS a bit, but this really needs SRE support for now. Marking as blocked and I'll check in on the GSRE ticket in a few weeks. |
[update] Blocked on the GSRE ticket which was picked up on Aug 22, but no comments were added yet. |
I will confirm that this has actually gone away. |
Todo:
|
|
Slack announce of initial fix. |
Note: Original GSRE comment on Sept 26 did not have a response yet. Posted a reminder today on Oct 2. |
SRE added a fix for edxapp workers on the morning of Oct 4. There were 2 week gaps between issues, so I'll leave this in blocked and wait until Nov to confirm. |
Another spike on Oct-10, but this was for the k8s servers.
|
The Oct-10 spike was because the Datadog agent was restarted for other purposes (SRE working on log parsing issues). |
There were 2 spikes in the last 15 days, presumably when the DD Agent pod gets rebooted. I'm going to close out this ticket and call this done. |
I noticed that the error in logs for
"failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming.
I'm wondering if this has anything to do with the long-running infinite loop on the consumers, and if we need to clean up the trace, like we clean up the db connection, etc.?
A/C:
The text was updated successfully, but these errors were encountered: