Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Delay drift between kafka input data to kafka output data #1144

Open
2 tasks done
elishahaim opened this issue Aug 23, 2023 · 6 comments
Open
2 tasks done

[BUG]: Delay drift between kafka input data to kafka output data #1144

elishahaim opened this issue Aug 23, 2023 · 6 comments
Labels
bug Something isn't working external This issue was filed by someone outside of the Morpheus team

Comments

@elishahaim
Copy link

Version

streaming ransomare model

Which installation method(s) does this occur on?

No response

Describe the bug.

We have a service in the DPU that is extracting raw data (memory features snapshots) and transmit it to kafka - all the time. The time between 2 memory snapshot is ~5 seconds. So, the ransomware detection pipeline should preprocess and inference each snapshot with less than 5 seconds, for not suffering from exploding delay. To test it, if we are suffering from an exploding delay, I monitor the kafka input to watch the input snapshot ID and in the same time I monitor the kafka output to watch the output snapshot ID.

It seems like, we are suffering from a weird phenomenon that the difference between the IDs is increasing but we also are receiving pulses of huge batches of messages to kafka output that is decreasing the difference between the IDs, so even after a long time, the difference is not exploding, but we are suffering from huge delays between the current input snapshot and the output snapshot. In addition, we have a maximum for this delay - 50 snapshots (4-3 minutes)… In all the experiment that I did, we never crossed the ~50 snapshot delay…

An example to explain my description:
In the beginning, the input snapshot ID is 1 and the output snapshot ID is 1.
After 3 minutes, input snapshot is 36 and output snapshot is 10.
After another 1 minute, input snapshot is 48 and output snapshot is 40.
After another 2 minutes, input snapshot is 72 and output snapshot is 46.
After another 2 minutes, input snapshot is 96 and output snapshot is 90.
And so on…
@bsuryadevara

Minimum reproducible example

No response

Relevant log output

No response

Full env printout

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@elishahaim elishahaim added the bug Something isn't working label Aug 23, 2023
@jarmak-nv jarmak-nv added the Needs Triage Need team to review and classify label Aug 23, 2023
@jarmak-nv
Copy link
Contributor

Hi @elishahaim!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can!
In the mean time, feel free to add any relevant information to this issue.

@elishahaim
Copy link
Author

@bsuryadevara

@mdemoret-nv
Copy link
Contributor

@bsuryadevara Is this due to the unbounded array issue we ran into earlier?

@jarmak-nv jarmak-nv added the external This issue was filed by someone outside of the Morpheus team label Sep 5, 2023
@bsuryadevara
Copy link
Contributor

bsuryadevara commented Sep 5, 2023

@mdemoret-nv This issue is unrelated to the current ransomware detection pipeline within Morpheus.

Here is some context. Initially, the ransomware detection example in the Morpheus repository was deemed to be in a production-ready state. However, the Networking Business Unit (NBU) team later made significant changes to the data structure and data generation process. As a result of these changes, a new production version emerged. Instead of relying on file-based input, the system now streams snapshot messages from Kafka, which are generated by the OS inspector.

In response to Bartley's request, I assisted Haim by offering a Proof of Concept (POC) that accommodates the new production data structure for streaming input via Kafka. I made changes to the existing pipeline for this purpose. However, the scalability of the feature creation and preprocessing stages was hindered due to Dask and creating single row dataframe. This is because the new version now processes input snapshots in a sequential order, whereas previously, multiple snapshots were fed into the pipeline all at once.

Now this drift issue is resolved. @elishahaim is working on creating PR for the new version of ransomware detection pipeline example with new models (with reduced features).

@mdemoret-nv
Copy link
Contributor

@bsuryadevara and @elishahaim Without seeing the pipeline, I cant really narrow down what could be causing this. It could be anything: batching in the pipeline, timeouts requesting services, Triton optimizing models, etc. There really is no way to narrow it down without a reproducer?

It sounds like a PR is on the way. Will this PR act as a reproducer for this issue? If not, can you provide a minimum reproducible example?

@jarmak-nv
Copy link
Contributor

Removing the triage label here; I see PR #1176 is working on this, but it's been a while.

@elishahaim any plans to pick this up again in the 24.03 timeline?

@jarmak-nv jarmak-nv removed the Needs Triage Need team to review and classify label Dec 11, 2023
@mdemoret-nv mdemoret-nv changed the title Delay drift between kafka input data to kafka output data[BUG]: [BUG]: Delay drift between kafka input data to kafka output data Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external This issue was filed by someone outside of the Morpheus team
Projects
Status: Todo
Development

No branches or pull requests

4 participants