-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka trigger timeout #93
Comments
Can you provide the log from scheduler please? |
Sure, I will provide a log when I encounter the issue, which should be in the next couple of days |
Hi @tchiotludo, The issue happened again, here is the scheduler log:
|
I also took a scheduler log print from when it was working and it looks like:
|
you have the log from the executor please? |
executor.txt |
Maybe a side note - seems that when the Scheduler Kubernetes pod is recreated, it does not pick up the processes and does not start the triggers to pool from Kafka, then when flows are manually edited it starts them again |
ok so now, I will need the worker log 😅 |
@tchiotludo We have three workers and the logs are quite extensive, they also contain some sensitive information. What should I be looking for? |
Since it's a bug, I don't know what could be the main issues. |
Hi @tchiotludo, I found lines like:
as a result I have a an empty execution with 0s of runtime and no trigger details. In the worker logs I see also:
now these log prints happen when I save the flow with an added blank line, after that everything works properly. Other messages when the executions were halted yield no error or warning or in fact any log for the affected flows.. To me it looks like the Realtime triggers are not restarted after pod failure, but I have one more idea that can potentially affect this. We are also using Looking forward to hearing from you, let me know if you need any additional information. Thanks |
Hi @tchiotludo, any ideas how to proceed? Best, |
Hi @tchiotludo , we are planning to go in production soon, could you please assist us with this bug? |
as I see, we need to have a full stack trace to understand, unknown exception could not help, this one should be on the flows > logs page. |
Hi @tchiotludo , as I explained earlier, there is no error or stack trace anywhere or any kind of log, not even info log, that is related to the affected flows. It seems that the Kafka realtime trigger gets disabled at some point and Kestra is not pooling on the new messages. It looks like some kind of timeout somewhere that is affecting Kestra as a consumer. |
You could increase the log level to capture it as trace |
@tchiotludo I will try it and get back to you, thanks |
@nikolicdragoslav does the issue persist? |
@nikolicdragoslav |
Hi guys, sorry for the late reply. We ended up disabling the sync workflow and the issue was not showing at first, but recently it is back again.. Here are some points:
I can try to play with max.poll.interval.ms and see what that gives back, but since this is an ongoing issue for quite sometime and it is preventing us to go to production, would it be possible to have a huddle on slack and try to debug the issue? Thanks, |
@nikolicdragoslav can you share the flow, your kestra version and any extra details that could help us reproduce? thx a lot! |
@anna-geller of course, here is one of the workflows for which Kestra stops polling from Kafka topic.
regarding the version, it happened with: but we are using the newer one at the moment with same issues: Here is a screenshot of the execution with 0s when I create a new revision of the flow in order for trigger to start running again: Last working execution before stoppage has revision of 225, the faulty one has also revision 225 and when I created a new one 226 it started working properly. Faulty execution with 0s doesn't have any logs in Gantt or in Logs tab, also what is weird is that the Trigger section is empty. Normal executions have Trigger section populated with proper event from Kafka topic. |
Can you look at the logs on the server at the time the execution failed please ? |
Hi @tchiotludo, I looked into the logs at the time when I had the faulty execution and this is what I see shortly before the failure. I have masked the broker and IP information from the stack trace. Just an FYI we are using MSK cluster on AWS.
Do you have any idea what could be causing this and how to resolve it? Thanks, |
The error is expected, and I didn't get any clue, seems a transient one. The fact is that is should not prevent future execution. Does this log are coming from kestra? the format seems not ours. |
@tchiotludo the logs are from Grafana, but Kestra pods are being scraped, only thing I can see in Kestra logs in the UI is:
if you disregard this part of the lines, for example: you will see the format is exactly from Kestra logs |
maybe so you can piece it together, this is what happens shortly before the logs I already shared:
|
So it's an invalid ssl connection, could you try again with latest version please? |
@tchiotludo would you mind providing the tag from I can try doing it. Thanks |
latest is the perfect one |
@tchiotludo I will get back to you after some testing with the new version, thank you |
@tchiotludo it seems that it has better behavior, I will wait for some time to monitor it closely before closing the ticket. Just for note, latest tag is using v0.20.6 as I can see in the UI. |
Describe the issue
Hi,
I am having a weird issue when using Kafka realtime plugin as a trigger to my workflow.
This is my trigger:
Trigger works great for couple of days, but at one point it stops pooling on the Kafka topic. The messages are there and can be read by other consumers, it just seems that the Kestra somehow stops consuming the messages and in return does not kickoff any execution.
Workaround at the moment is editing the flow code by simply adding or removing a blank line and saving it. When saved, all missed messages are being read at the same time and it starts bunch of executions at the same time.
As of now, there doesn't seem to be any log that shows an error, both in Kafka and in Kestra, it just seems that the trigger is disabled, but it is in fact enabled.
Is there some setting in Kafka or in Kestra that causes this timeout and needs to be altered? If not, is this simply a bug in the plugin?
Best,
Dragoslav
Environment
The text was updated successfully, but these errors were encountered: