Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka output stalling on small topic #444

Open
whd opened this issue Jul 15, 2019 · 2 comments
Open

Kafka output stalling on small topic #444

whd opened this issue Jul 15, 2019 · 2 comments

Comments

@whd
Copy link
Member

whd commented Jul 15, 2019

The scenario we're seeing is that a kafka output to a particular topic is stalling i.e. hindsight.cp shows that the output is not processing further data. This eventually leads to backpressure because of low disk free. In most cases it resolves itself automatically before reaching low disk free, perhaps due to some retry loop finally succeeding.

The topic we're seeing this on appears to be healthy from a kafka perspective, in that it is fully replicated and all leaders are preferred:

Replication 	3
Number of Partitions 	100
Sum of partition offsets 	33011035
Total number of Brokers 	9
Number of Brokers for Topic 	9
Preferred Replicas % 	100
Brokers Skewed % 	0
Brokers Leader Skewed % 	0
Brokers Spread % 	100
Under-replicated % 	0 

Of note it appears coincidentally that the volume of output to this already sparse topic dropped at roughly the same time the issue arose. Restarting the output appears to resolve the issue as a workaround, though if there is an in-flight ping at restart time perhaps it is lost.

Per IRC discussion there is likely either an underlying HS async cp bug, a libkafka problem, or a kafka issue. As far as I can tell, the kafka cluster is fully operational. I'm guessing the volume drop on input is not coincidental and is related to the issue, but further investigation would be required to determine if this is the case.

@trink
Copy link
Contributor

trink commented Jul 19, 2019

Issue moved the the Hindsight repository

@trink trink closed this as completed Jul 19, 2019
@trink
Copy link
Contributor

trink commented Sep 4, 2019

There may also be a data race condition in the sandbox Kafka module causing it to miss an ack, normally this is not as issue as it is treated as a high water mark and the next one will advance it. I was unable to reproduce this while testing but we are still seeing an occasional stall in production so more investigation is necessary.

@trink trink reopened this Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants