Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error message when consumer not authorized for topic #226

Open
rgraber opened this issue Jan 26, 2024 · 5 comments
Open

Better error message when consumer not authorized for topic #226

rgraber opened this issue Jan 26, 2024 · 5 comments
Labels
event-bus Work related to the Event Bus.

Comments

@rgraber
Copy link
Contributor

rgraber commented Jan 26, 2024

When we adjusted ACLs for some Kafka topics, a consumer started failing with a misleading error message (Missing ce_type header on message, cannot determine signal) that caused us to think there was a malformed message at the start of the topic that was blocking consumption.

The real error (either Broker: Topic authorization failed or Group authorization failed) was buried in the context data; we should figure out how to surface that error instead. This might involve checking for a None offset or other error indicators before we try inspecting the message headers.

Original description

An error in the discovery consumer:

2024-01-26 14:00:27,100 ERROR 1 [edx_event_bus_kafka.internal.consumer] consumer.py:555 - Error consuming event from Kafka: UnusableMessageError('Missing ce_type header on message, cannot determine signal') in context full_topic='prod-course-authoring-xblock-lifecycle', consumer_group='course_discovery_prod' -- event details: {'partition': 0, 'offset': None, 'headers': None, 'key': None, 'value': b'Subscribed topic not available: prod-course-authoring-xblock-lifecycle: Broker: Topic authorization failed'}Traceback (most recent call last): File "/edx/app/discovery/venvs/discovery/lib/python3.8/site-packages/edx_event_bus_kafka/internal/consumer.py", line 312, in _consume_indefinitely signal = self.determine_signal(msg) File "/edx/app/discovery/venvs/discovery/lib/python3.8/site-packages/edx_event_bus_kafka/internal/consumer.py", line 405, in determine_signal event_type = self._get_event_type_from_message(msg) File "/edx/app/discovery/venvs/discovery/lib/python3.8/site-packages/edx_event_bus_kafka/internal/consumer.py", line 426, in _get_event_type_from_message raise UnusableMessageError(edx_event_bus_kafka.internal.consumer.UnusableMessageError: Missing ce_type header on message, cannot determine signal

It's unclear why the consumer is not able to move past this error

@robrap robrap moved this to Groomed in Arch-BOM Jan 26, 2024
@robrap robrap removed the status in Arch-BOM Jan 26, 2024
@robrap
Copy link
Contributor

robrap commented Jan 26, 2024

  • It seems the actual problem was Topic authorization failed, which was buried in the error. We couldn't get to any of the messages.
  • The error Error consuming event from Kafka: UnusableMessageError('Missing ce_type header on message, cannot determine signal') is true, but very misleading.
  • We weren't alerted to this problem.

Ideally, if the entire topic is not reachable and we can't get to any messages:

  1. The error should be more clear, and
  2. We should have alerting to immediately detect this, whether it is alerting that goes to the owner or us, or some combo (e.g. safety net).

@dianakhuang
Copy link
Contributor

We believe this was caused by a misconfigured ACL, which has now been corrected. We should have better reporting on when this sort of thing happens so we can fix it.

@robrap
Copy link
Contributor

robrap commented Feb 2, 2024

@timmc-edx will look into rewriting this ticket, potentially splitting into two parts (error message and alerting).

@timmc-edx timmc-edx changed the title Consumer gets stuck on certain error Better error message when consumer not authorized for topic Feb 2, 2024
@timmc-edx timmc-edx added the event-bus Work related to the Event Bus. label Feb 2, 2024
@timmc-edx
Copy link
Contributor

I've updated this ticket, and there are already a couple of tickets to cover the alerting side of things:

@robrap robrap moved this to Prioritized in Arch-BOM Feb 5, 2024
@dianakhuang
Copy link
Contributor

After investigating this issue on DataDog, it seems like the consumer lag metric wasn't being recorded for this topic at all before we fixed the ACL. We will probably need to make alerts for this sort of thing based on logs (once we get logs in DataDog, probably).

@jristau1984 jristau1984 moved this from Prioritized to Backlog in Arch-BOM Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
event-bus Work related to the Event Bus.
Projects
None yet
Development

No branches or pull requests

4 participants