Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve kafka consumer dropped traces errors in Datadog #736

Closed
4 of 6 tasks
robrap opened this issue Jul 26, 2024 · 13 comments
Closed
4 of 6 tasks

Resolve kafka consumer dropped traces errors in Datadog #736

robrap opened this issue Jul 26, 2024 · 13 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Jul 26, 2024

I noticed that the error in logs for
"failed to send, dropping 1 traces to intake at unix:///var/run/datadog/apm.socket/v0.5/traces after 3 retries" seems to be hitting our kafka consumers. It may be hitting some other workers, but not sure if we just have inconsistent naming.

I'm wondering if this has anything to do with the long-running infinite loop on the consumers, and if we need to clean up the trace, like we clean up the db connection, etc.?

A/C:

  • Confirm with DD Support that this is actually important to fix and doesn't just represent an expected level of failed trace sends
    • They should also be able to give us some debugging pointers
  • Either document as wontfix, or fix it
    • Confirm fix (on or after Sept 13th)
    • Communicate fix for non-edxapp workers.
    • Confirm if edxapp workers get fixed in Nov 2024 (SRE believes this to be fixed)
    • Maybe update thread once edxapp workers have been fixed.
@robrap robrap added this to Arch-BOM Jul 26, 2024
@robrap robrap converted this from a draft issue Jul 26, 2024
@robrap robrap changed the title Resolve consumer worker dropped traces errors in Datadog Resolve kafka consumer dropped traces errors in Datadog Jul 26, 2024
@jristau1984 jristau1984 moved this to Prioritized in Arch-BOM Jul 29, 2024
@robrap robrap moved this from Prioritized to Backlog in Arch-BOM Jul 29, 2024
@timmc-edx timmc-edx moved this from Backlog to Ready For Development in Arch-BOM Aug 12, 2024
@robrap robrap self-assigned this Aug 12, 2024
@robrap
Copy link
Contributor Author

robrap commented Aug 12, 2024

@robrap
Copy link
Contributor Author

robrap commented Aug 13, 2024

After thinking about how to respond to the DD ticket, and looking at logs more closely, I decided to open an SRE ticket for further investigation: https://2u-internal.atlassian.net/browse/GSRE-1988.

@robrap
Copy link
Contributor Author

robrap commented Aug 14, 2024

I tried looking in AWS a bit, but this really needs SRE support for now. Marking as blocked and I'll check in on the GSRE ticket in a few weeks.

@robrap
Copy link
Contributor Author

robrap commented Aug 26, 2024

[update] Blocked on the GSRE ticket which was picked up on Aug 22, but no comments were added yet.

@robrap
Copy link
Contributor Author

robrap commented Sep 6, 2024

I will confirm that this has actually gone away.

@robrap
Copy link
Contributor Author

robrap commented Sep 6, 2024

  • The fix was deployed on Sept 5th @ 10:00am ET, and the GSRE ticket was closed.
  • I'd like to leave this blocked for at least a week (Sept 13th), and then confirm that the original search confirms that the issue has gone away. So far so good, but it has only been a day.

Todo:

  • Confirm fix.
  • Communicate, in case anyone was aware.

@robrap robrap moved this from In Progress to Blocked in Arch-BOM Sep 6, 2024
@robrap
Copy link
Contributor Author

robrap commented Sep 25, 2024

@robrap
Copy link
Contributor Author

robrap commented Sep 25, 2024

Slack announce of initial fix.

@robrap
Copy link
Contributor Author

robrap commented Oct 2, 2024

Note: Original GSRE comment on Sept 26 did not have a response yet. Posted a reminder today on Oct 2.

@robrap
Copy link
Contributor Author

robrap commented Oct 4, 2024

SRE added a fix for edxapp workers on the morning of Oct 4. There were 2 week gaps between issues, so I'll leave this in blocked and wait until Nov to confirm.

@robrap
Copy link
Contributor Author

robrap commented Oct 10, 2024

Another spike on Oct-10, but this was for the k8s servers.

  • Nov is still a good time to confirm the edxapp workers fix.
  • But, we need to check in on the GSRE ticket to see if there will be a new change to review.

@robrap
Copy link
Contributor Author

robrap commented Oct 11, 2024

The Oct-10 spike was because the Datadog agent was restarted for other purposes (SRE working on log parsing issues).

@robrap
Copy link
Contributor Author

robrap commented Nov 12, 2024

There were 2 spikes in the last 15 days, presumably when the DD Agent pod gets rebooted. I'm going to close out this ticket and call this done.

@robrap robrap closed this as completed Nov 12, 2024
@github-project-automation github-project-automation bot moved this from Blocked to Done in Arch-BOM Nov 12, 2024
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

1 participant