-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Delay drift between kafka input data to kafka output data #1144
Comments
Hi @elishahaim! Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! |
@bsuryadevara Is this due to the unbounded array issue we ran into earlier? |
@mdemoret-nv This issue is unrelated to the current ransomware detection pipeline within Morpheus. Here is some context. Initially, the ransomware detection example in the Morpheus repository was deemed to be in a production-ready state. However, the Networking Business Unit (NBU) team later made significant changes to the data structure and data generation process. As a result of these changes, a new production version emerged. Instead of relying on file-based input, the system now streams snapshot messages from Kafka, which are generated by the OS inspector. In response to Bartley's request, I assisted Haim by offering a Proof of Concept (POC) that accommodates the new production data structure for streaming input via Kafka. I made changes to the existing pipeline for this purpose. However, the scalability of the feature creation and preprocessing stages was hindered due to Dask and creating single row dataframe. This is because the new version now processes input snapshots in a sequential order, whereas previously, multiple snapshots were fed into the pipeline all at once. Now this drift issue is resolved. @elishahaim is working on creating PR for the new version of ransomware detection pipeline example with new models (with reduced features). |
@bsuryadevara and @elishahaim Without seeing the pipeline, I cant really narrow down what could be causing this. It could be anything: batching in the pipeline, timeouts requesting services, Triton optimizing models, etc. There really is no way to narrow it down without a reproducer? It sounds like a PR is on the way. Will this PR act as a reproducer for this issue? If not, can you provide a minimum reproducible example? |
Removing the triage label here; I see PR #1176 is working on this, but it's been a while. @elishahaim any plans to pick this up again in the |
Version
streaming ransomare model
Which installation method(s) does this occur on?
No response
Describe the bug.
We have a service in the DPU that is extracting raw data (memory features snapshots) and transmit it to kafka - all the time. The time between 2 memory snapshot is ~5 seconds. So, the ransomware detection pipeline should preprocess and inference each snapshot with less than 5 seconds, for not suffering from exploding delay. To test it, if we are suffering from an exploding delay, I monitor the kafka input to watch the input snapshot ID and in the same time I monitor the kafka output to watch the output snapshot ID.
It seems like, we are suffering from a weird phenomenon that the difference between the IDs is increasing but we also are receiving pulses of huge batches of messages to kafka output that is decreasing the difference between the IDs, so even after a long time, the difference is not exploding, but we are suffering from huge delays between the current input snapshot and the output snapshot. In addition, we have a maximum for this delay - 50 snapshots (4-3 minutes)… In all the experiment that I did, we never crossed the ~50 snapshot delay…
An example to explain my description:
In the beginning, the input snapshot ID is 1 and the output snapshot ID is 1.
After 3 minutes, input snapshot is 36 and output snapshot is 10.
After another 1 minute, input snapshot is 48 and output snapshot is 40.
After another 2 minutes, input snapshot is 72 and output snapshot is 46.
After another 2 minutes, input snapshot is 96 and output snapshot is 90.
And so on…
@bsuryadevara
Minimum reproducible example
No response
Relevant log output
No response
Full env printout
No response
Other/Misc.
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: