Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function #626

jorgeesg · 2024-05-31T22:44:41Z

What's the issue, what's expected?:
Given a baseline file and a diagnosis rule file, the generated diagnosis_summary report varies between executions.
The inconsistent diagnosis behavior occurs when using the "failure_check" function in the diagnosis rule file.

How to reproduce it?:

Have a superbench results jsonl file with multiple nodes data if possible, to facilitate the testing.
Have your diagnosis rule file which specifically uses the "failure_check" function for some rules. For example, use it for a rule that verifies the return code metrics, much like the example in the documentation -> https://microsoft.github.io/superbenchmark/docs/user-tutorial/data-diagnosis/
Have a baseline file to use with data diagnosis.
Using your terminal, run superbench data diagnosis multiple times. Make sure to use the "--output-all" flag to see the status report for all your nodes. You will see that log messages in your terminal will be inconsistent between runs, even though you did not change any inputs, you are simply re-running the command. Attempt to run it at least 10 consecutive times to be sure that the inconsistent behavior shows up.
In between runs, check the diagnosis_report. You will see that certain nodes status will vary between accepted and failed states, due to the return code metrics.
Sometimes the report will be accurate and properly use the return code metrics, while other times it will erroneously say all nodes are bad due to the return code metrics, even though the return code was correct for some of the nodes (0).

Logs and snapshots:
When return code metrics are properly used by the data diagnosis process, you will always see two log messages, like in this picture. One for data_diagnosis.py line 265, and the other one for line 330.

In contrast, this second image shows how logs look like when superbench does not correctly use the return code diagnosis rules. It will just mark all nodes as bad, using all the return code metrics.

Additional information:
SB version - 0.10
This bad behavior's current workaround is to NOT use the "failure_check" function and instead replace it with "value". However, users using "failure_check" may be unaware of this behavior.

jorgeesg · 2024-06-28T19:04:34Z

Note, upon further testing, this behavior inconsistency can also be triggered even if there's no usage of failure_check rules in the diagnosis rule file.

yukirora · 2024-07-23T10:32:36Z

Hi, thanks for reaching us!
Could you please provide more information for us to reproduce the issue, including the raw data file, rule file, command and the pandas version?

cp5555 · 2024-08-02T22:54:18Z

Hi @jorgeesg, do you still have this issue? If not, we will close it.

jorgeesg · 2024-08-02T23:43:44Z

Hello @cp5555 and @yukirora I can re-test this and gather the relevant information and post back early next week about this issue. Thanks.

yukirora · 2024-08-19T09:12:40Z

Hi Jorge, we have merged PR #638 to fix the issue, this issue is going to be close, please let us know if you have more questions.

jorgeesg · 2024-08-19T17:08:43Z

Thank you very much for the support and help :)

jorgeesg changed the title ~~Data Diagnosis command is not idempotent when~~ Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function May 31, 2024

yukirora closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function #626

Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function #626

jorgeesg commented May 31, 2024

jorgeesg commented Jun 28, 2024

yukirora commented Jul 23, 2024

cp5555 commented Aug 2, 2024

jorgeesg commented Aug 2, 2024 •

edited

Loading

yukirora commented Aug 19, 2024

jorgeesg commented Aug 19, 2024

Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function #626

Data Diagnosis command is not idempotent when diagnosis rule file uses failure_check function #626

Comments

jorgeesg commented May 31, 2024

jorgeesg commented Jun 28, 2024

yukirora commented Jul 23, 2024

cp5555 commented Aug 2, 2024

jorgeesg commented Aug 2, 2024 • edited Loading

yukirora commented Aug 19, 2024

jorgeesg commented Aug 19, 2024

jorgeesg commented Aug 2, 2024 •

edited

Loading