Kafka output stalling on small topic #444

whd · 2019-07-15T20:41:13Z

The scenario we're seeing is that a kafka output to a particular topic is stalling i.e. hindsight.cp shows that the output is not processing further data. This eventually leads to backpressure because of low disk free. In most cases it resolves itself automatically before reaching low disk free, perhaps due to some retry loop finally succeeding.

The topic we're seeing this on appears to be healthy from a kafka perspective, in that it is fully replicated and all leaders are preferred:

Replication 	3
Number of Partitions 	100
Sum of partition offsets 	33011035
Total number of Brokers 	9
Number of Brokers for Topic 	9
Preferred Replicas % 	100
Brokers Skewed % 	0
Brokers Leader Skewed % 	0
Brokers Spread % 	100
Under-replicated % 	0

Of note it appears coincidentally that the volume of output to this already sparse topic dropped at roughly the same time the issue arose. Restarting the output appears to resolve the issue as a workaround, though if there is an in-flight ping at restart time perhaps it is lost.

Per IRC discussion there is likely either an underlying HS async cp bug, a libkafka problem, or a kafka issue. As far as I can tell, the kafka cluster is fully operational. I'm guessing the volume drop on input is not coincidental and is related to the issue, but further investigation would be required to determine if this is the case.

The text was updated successfully, but these errors were encountered:

trink · 2019-07-19T21:28:51Z

Issue moved the the Hindsight repository

trink · 2019-09-04T15:19:24Z

There may also be a data race condition in the sandbox Kafka module causing it to miss an ack, normally this is not as issue as it is treated as a high water mark and the next one will advance it. I was unable to reproduce this while testing but we are still seeing an occasional stall in production so more investigation is necessary.

trink mentioned this issue Jul 19, 2019

Async output checkpoints are only updated when consuming data mozilla-services/hindsight#190

Closed

trink closed this as completed Jul 19, 2019

trink reopened this Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka output stalling on small topic #444

Kafka output stalling on small topic #444

whd commented Jul 15, 2019

trink commented Jul 19, 2019

trink commented Sep 4, 2019

Kafka output stalling on small topic #444

Kafka output stalling on small topic #444

Comments

whd commented Jul 15, 2019

trink commented Jul 19, 2019

trink commented Sep 4, 2019