You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scenario we're seeing is that a kafka output to a particular topic is stalling i.e. hindsight.cp shows that the output is not processing further data. This eventually leads to backpressure because of low disk free. In most cases it resolves itself automatically before reaching low disk free, perhaps due to some retry loop finally succeeding.
The topic we're seeing this on appears to be healthy from a kafka perspective, in that it is fully replicated and all leaders are preferred:
Replication 3
Number of Partitions 100
Sum of partition offsets 33011035
Total number of Brokers 9
Number of Brokers for Topic 9
Preferred Replicas % 100
Brokers Skewed % 0
Brokers Leader Skewed % 0
Brokers Spread % 100
Under-replicated % 0
Of note it appears coincidentally that the volume of output to this already sparse topic dropped at roughly the same time the issue arose. Restarting the output appears to resolve the issue as a workaround, though if there is an in-flight ping at restart time perhaps it is lost.
Per IRC discussion there is likely either an underlying HS async cp bug, a libkafka problem, or a kafka issue. As far as I can tell, the kafka cluster is fully operational. I'm guessing the volume drop on input is not coincidental and is related to the issue, but further investigation would be required to determine if this is the case.
The text was updated successfully, but these errors were encountered:
There may also be a data race condition in the sandbox Kafka module causing it to miss an ack, normally this is not as issue as it is treated as a high water mark and the next one will advance it. I was unable to reproduce this while testing but we are still seeing an occasional stall in production so more investigation is necessary.
The scenario we're seeing is that a kafka output to a particular topic is stalling i.e. hindsight.cp shows that the output is not processing further data. This eventually leads to backpressure because of low disk free. In most cases it resolves itself automatically before reaching low disk free, perhaps due to some retry loop finally succeeding.
The topic we're seeing this on appears to be healthy from a kafka perspective, in that it is fully replicated and all leaders are preferred:
Of note it appears coincidentally that the volume of output to this already sparse topic dropped at roughly the same time the issue arose. Restarting the output appears to resolve the issue as a workaround, though if there is an in-flight ping at restart time perhaps it is lost.
Per IRC discussion there is likely either an underlying HS async cp bug, a libkafka problem, or a kafka issue. As far as I can tell, the kafka cluster is fully operational. I'm guessing the volume drop on input is not coincidental and is related to the issue, but further investigation would be required to determine if this is the case.
The text was updated successfully, but these errors were encountered: