Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function #626

Closed
jorgeesg opened this issue May 31, 2024 · 6 comments

Comments

@jorgeesg
Copy link

What's the issue, what's expected?:
Given a baseline file and a diagnosis rule file, the generated diagnosis_summary report varies between executions.
The inconsistent diagnosis behavior occurs when using the "failure_check" function in the diagnosis rule file.

How to reproduce it?:

  1. Have a superbench results jsonl file with multiple nodes data if possible, to facilitate the testing.
  2. Have your diagnosis rule file which specifically uses the "failure_check" function for some rules. For example, use it for a rule that verifies the return code metrics, much like the example in the documentation -> https://microsoft.github.io/superbenchmark/docs/user-tutorial/data-diagnosis/
  3. Have a baseline file to use with data diagnosis.
  4. Using your terminal, run superbench data diagnosis multiple times. Make sure to use the "--output-all" flag to see the status report for all your nodes. You will see that log messages in your terminal will be inconsistent between runs, even though you did not change any inputs, you are simply re-running the command. Attempt to run it at least 10 consecutive times to be sure that the inconsistent behavior shows up.
  5. In between runs, check the diagnosis_report. You will see that certain nodes status will vary between accepted and failed states, due to the return code metrics.
  6. Sometimes the report will be accurate and properly use the return code metrics, while other times it will erroneously say all nodes are bad due to the return code metrics, even though the return code was correct for some of the nodes (0).

Logs and snapshots:
When return code metrics are properly used by the data diagnosis process, you will always see two log messages, like in this picture. One for data_diagnosis.py line 265, and the other one for line 330.
image

In contrast, this second image shows how logs look like when superbench does not correctly use the return code diagnosis rules. It will just mark all nodes as bad, using all the return code metrics.
image

Additional information:
SB version - 0.10
This bad behavior's current workaround is to NOT use the "failure_check" function and instead replace it with "value". However, users using "failure_check" may be unaware of this behavior.

@jorgeesg jorgeesg changed the title Data Diagnosis command is not idempotent when Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function May 31, 2024
@jorgeesg
Copy link
Author

Note, upon further testing, this behavior inconsistency can also be triggered even if there's no usage of failure_check rules in the diagnosis rule file.

@yukirora
Copy link
Contributor

Hi, thanks for reaching us!
Could you please provide more information for us to reproduce the issue, including the raw data file, rule file, command and the pandas version?

@cp5555
Copy link
Contributor

cp5555 commented Aug 2, 2024

Hi @jorgeesg, do you still have this issue? If not, we will close it.

@jorgeesg
Copy link
Author

jorgeesg commented Aug 2, 2024

Hello @cp5555 and @yukirora I can re-test this and gather the relevant information and post back early next week about this issue. Thanks.

@yukirora
Copy link
Contributor

Hi Jorge, we have merged PR #638 to fix the issue, this issue is going to be close, please let us know if you have more questions.

@jorgeesg
Copy link
Author

Thank you very much for the support and help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants