Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib.sh: perf_analyze(): drop --skip-to-first-trace #1150

Merged
merged 1 commit into from
Jan 18, 2024

Conversation

btian1
Copy link
Contributor

@btian1 btian1 commented Jan 10, 2024

(very different from the initial description)

The --skip-to-first-trace option was added as a precaution to exclude
log pollution from before the start of a test, normally the residue traces are
related to trigger stop IPC, dma put and component reset (as you can see).
There could be 10 ~ 20 lines.

mtrace timestamps used to return to zero on every DSP boot, which --skip-to-first-trace
depends on. After SOF commit 76b0979c6453 ("sync time between host and fw for logging"),
mtrace timestamps still return to zero (briefly) but only on "cold" DSP boots,
not on when booting from IMR. Booting from IMR has now been enabled by default
in most configurations. That combination breaks --skip-to-first-trace.

The only practical short-term solution is to turn off --skip-to-first-trace and take
the risk of old logs polluting the test results, which is what this commit does.

If results pollution proves to be a problem, we could in the longer term try to disable
IMR boot just before running this test using sof_debug as discussed in #1153
This would also measure from a possibly cleaner DSP state. New, DSP "debug ops"
being proposed in thesofproject/linux#4726 might also help.

@btian1 btian1 requested a review from a team as a code owner January 10, 2024 08:58
trace_prev = None
try:
if args.skip_to_first_trace:
trace_prev = skip_to_first_trace(trace_item_gen)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not remove the function?

Copy link

@aiChaoSONG aiChaoSONG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be precise, there is no new logging system introduced.

With new logging system introduced, SOF logging timestamp
are aligned with kernel, so SOF logging timestamp can't be zero
for following case.

Based on this, skip to first trace are not needed anymore,
hence remove it from current script.

Signed-off-by: Baofeng Tian <[email protected]>
The firmware timestamp is in sync with the host side now,
and becomes much more deterministic. The condition to use
--skip-to-first-trace option in CI is eliminated (see the help
info for the reason it is used in CI test).

Based on this, skip-to-first-trace are not needed anymore,
hence remove it from the script.

Link: https://github.com/thesofproject/sof/pull/8480

Signed-off-by: Baofeng Tian <[email protected]>

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip to first trace are not needed anymore, hence remove it from current script.

This is NOT the real reason why you're submitting this PR.

You're submitting this PR because a test started FAILING following some logging change.

Describe the test failure and the issue first.

So far it's not obvious why a minor looking timestamp change broke this and why it requires removing code. sof-test must stay usable on older product versions.

Submit a new GitHub issue if the analysis of the issue is too long for a commit message.

@btian1 btian1 force-pushed the remove_skip_first_trace branch 2 times, most recently from 045f844 to 7daa1be Compare January 10, 2024 10:49
@btian1 btian1 requested review from marc-hb and aiChaoSONG January 10, 2024 10:52
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared test logs before/after the thesofproject/sof#8480 change (merged on Jan. 8th). Did you?

While the return to zero is gone, the logs from the previous test are STILL THERE and they can still pollute the measurement. That has not changed. Only their timestamp has changed but the content has not changed so they must still be skipped.

Please discuss with @RanderWang to understand the new timestamp behavior and design a solution to skip logs from the previous test that is compatible with both old and new versions.

(BTW you're removing the --skip-to-first-trace command line option while leaving the function behind. Even if removing the whole thing were right, leaving dead code behind would be sloppy).

@btian1 btian1 force-pushed the remove_skip_first_trace branch from 7daa1be to 3947c34 Compare January 11, 2024 02:25
@marc-hb
Copy link
Collaborator

marc-hb commented Jan 11, 2024

While the performance logs look different, here are some sample logs in another test:
https://sof-ci.01.org/sofpr/PR8571/build1764/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=check-suspend-resume-with-playback

[ 5496.541551] <inf> host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 0, free_samples: 192
[ 5496.542568] <inf> host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 0, free_samples: 192
[ 5497.534570] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2463, max 8105, overruns 0
[ 5498.558570] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2444, max 2511, overruns 0
[    0.000000] <inf> dma_intel_adsp_gpdma: intel_adsp_gpdma_power_on: intel_adsp_gpdma_power_on: dma dma@7d000 initialized
[    0.000000] <inf> dma_intel_adsp_gpdma: intel_adsp_gpdma_power_on: intel_adsp_gpdma_power_on: dma dma@7c000 initialized
[    0.000000] <inf> init: print_version_banner: FW ABI 0x301d000 DBG ABI 0x5003000 tags SOF:v2.5-stable-branch-1966-g73327103315f zephyr:zephyr-v3.5.0-3682-gd7af6f371034 src hash 0xdceb0e30 (ref hash 0xdceb0e30)
*** Booting Zephyr OS build zephyr-v3.5.0-3682-gd7af6f371034 ***
[    0.000000] <inf> main: sof_app_main: SOF on intel_adsp_cavs25
[    0.000000] <inf> main: sof_app_main: SOF initialized
[    0.000000] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[ 5504.678680] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[ 5505.391256] <inf> ipc: ipc_cmd: rx	: 0x11000007|0x0
[ 5505.391278] <inf> pipe: pipeline_new: pipeline new pipe_id 0 priority 0
[ 5505.392053] <inf> ipc: ipc_cmd: rx	: 0x40000004|0x15

The timestamp still starts at zero on DSP boot.

@btian1
Copy link
Contributor Author

btian1 commented Jan 11, 2024

While the performance logs look different, here are some sample logs in another test: https://sof-ci.01.org/sofpr/PR8571/build1764/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=check-suspend-resume-with-playback

[ 5496.541551] <inf> host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 0, free_samples: 192
[ 5496.542568] <inf> host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 0, free_samples: 192
[ 5497.534570] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2463, max 8105, overruns 0
[ 5498.558570] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2444, max 2511, overruns 0
[    0.000000] <inf> dma_intel_adsp_gpdma: intel_adsp_gpdma_power_on: intel_adsp_gpdma_power_on: dma dma@7d000 initialized
[    0.000000] <inf> dma_intel_adsp_gpdma: intel_adsp_gpdma_power_on: intel_adsp_gpdma_power_on: dma dma@7c000 initialized
[    0.000000] <inf> init: print_version_banner: FW ABI 0x301d000 DBG ABI 0x5003000 tags SOF:v2.5-stable-branch-1966-g73327103315f zephyr:zephyr-v3.5.0-3682-gd7af6f371034 src hash 0xdceb0e30 (ref hash 0xdceb0e30)
*** Booting Zephyr OS build zephyr-v3.5.0-3682-gd7af6f371034 ***
[    0.000000] <inf> main: sof_app_main: SOF on intel_adsp_cavs25
[    0.000000] <inf> main: sof_app_main: SOF initialized
[    0.000000] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[ 5504.678680] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[ 5505.391256] <inf> ipc: ipc_cmd: rx	: 0x11000007|0x0
[ 5505.391278] <inf> pipe: pipeline_new: pipeline new pipe_id 0 priority 0
[ 5505.392053] <inf> ipc: ipc_cmd: rx	: 0x40000004|0x15

The timestamp still starts at zero on DSP boot.

this is not a performance logging, when test some features, DSP will be restarted, and then 0 timestamp will happen.
That's to say, 0 will only happend on DSP power on.

I am changing the patch to only remove this option from lib.sh, and still keep the functionality in python script

@btian1 btian1 force-pushed the remove_skip_first_trace branch from 3947c34 to 78c0eeb Compare January 11, 2024 05:10
@RanderWang
Copy link
Contributor

While the performance logs look different, here are some sample logs in another test: https://sof-ci.01.org/sofpr/PR8571/build1764/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=check-suspend-resume-with-playback

[ 5496.541551] <inf> host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 0, free_samples: 192
[ 5496.542568] <inf> host_comp: host_get_copy_bytes_normal: comp:0 0x4 no bytes to copy, available samples: 0, free_samples: 192
[ 5497.534570] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2463, max 8105, overruns 0
[ 5498.558570] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2444, max 2511, overruns 0
[    0.000000] <inf> dma_intel_adsp_gpdma: intel_adsp_gpdma_power_on: intel_adsp_gpdma_power_on: dma dma@7d000 initialized
[    0.000000] <inf> dma_intel_adsp_gpdma: intel_adsp_gpdma_power_on: intel_adsp_gpdma_power_on: dma dma@7c000 initialized
[    0.000000] <inf> init: print_version_banner: FW ABI 0x301d000 DBG ABI 0x5003000 tags SOF:v2.5-stable-branch-1966-g73327103315f zephyr:zephyr-v3.5.0-3682-gd7af6f371034 src hash 0xdceb0e30 (ref hash 0xdceb0e30)
*** Booting Zephyr OS build zephyr-v3.5.0-3682-gd7af6f371034 ***
[    0.000000] <inf> main: sof_app_main: SOF on intel_adsp_cavs25
[    0.000000] <inf> main: sof_app_main: SOF initialized
[    0.000000] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[ 5504.678680] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[ 5505.391256] <inf> ipc: ipc_cmd: rx	: 0x11000007|0x0
[ 5505.391278] <inf> pipe: pipeline_new: pipeline new pipe_id 0 priority 0
[ 5505.392053] <inf> ipc: ipc_cmd: rx	: 0x40000004|0x15

The timestamp still starts at zero on DSP boot.

Before DSP boot complete, DSP uses its default timestamp. After it sends boot_ready msg to host, host will send ipc msg to sync time stamp and then DSP will use the same time stamp, in this case the timestamp jumps to 5504.678680
[ 0.000000] ipc: ipc_cmd: rx : 0x44000000|0x31400008 // ipc sync msg

@btian1 btian1 requested a review from marc-hb January 11, 2024 08:19
@marc-hb
Copy link
Collaborator

marc-hb commented Jan 12, 2024

The timestamp still starts at zero on DSP boot.

That's to say, 0 will only happend on DSP power on.

"boot" and "power on" mean the same thing in this context.

Important questions:

  • Why was a timestamp zero found before Rander's change? There was no DSP boot either. There has been been no DSP boot change.
  • Why is there now a timestamp jump backwards after Rander's change?
  • Why has a sample of the new logs still not been shared? This PR is a fix for an issue in logs that barely anyone has seen yet! Best to file a new issue for sharing this.
  • Why is it OK now to accept pollution from the previous test? It wasn't acceptable before.

I am changing the patch to only remove this option from lib.sh, and still keep the functionality in python script

That's a bit better but it's still not good enough because we still don't have a complete and clear explanation of what changed and how. As usual, let's not fix a problem that we don't really understand yet (i.e. "Programming by coincidence"). Please analyze, understand and share a complete explanation of the problem first. Only then we can look at a fix.

@btian1
Copy link
Contributor Author

btian1 commented Jan 12, 2024

The timestamp still starts at zero on DSP boot.

That's to say, 0 will only happend on DSP power on.

"boot" and "power on" mean the same thing in this context.

Important questions:

  • Why was a timestamp zero found before Rander's change? There was no DSP boot either. There has been been no DSP boot change.

because before Rander's change, each test case will re-start timestamp with 0, after Rander's change, only DSP restart can make timestamp to 0.

  • Why is there now a timestamp jump backwards after Rander's change?

check this link, we also observed this before Rander's change, it maybe a SOF bug.
https://github.com/thesofproject/sof-test/blob/main/tools/sof_perf_analyzer.py#L173
Anyway, this is not related with this patch.

  • Why has a sample of the new logs still not been shared? This PR is a fix for an issue in logs that barely anyone has seen yet! Best to file a new issue for sharing this.

Keqiao did not file bug, just informed Chao and me to fix, the reason I think in commit message already explained and this PR review comments also explained.

  • Why is it OK now to accept pollution from the previous test? It wasn't acceptable before.
    I am not aware what exactly you are asking, you can ping me for questions.

I am changing the patch to only remove this option from lib.sh, and still keep the functionality in python script

That's a bit better but it's still not good enough because we still don't have a complete and clear explanation of what changed and how. As usual, let's not fix a problem that we don't really understand yet (i.e. "Programming by coincidence"). Please analyze, understand and share a complete explanation of the problem first. Only then we can look at a fix.

@marc-hb ,this is the right way to go, performance test is failing and if we can't provide a fix on time, a lot of people will feel upset for the reason, especially for SQE, if this failing blocks milestone by SQE, I would feel more un-comfortable.

if you still have questions, you can offline ping me, you can also went through the scirpt, since you are script expert.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 16, 2024

because before Rander's change, each test case will re-start timestamp with 0, after Rander's change, only DSP restart can make timestamp to 0.

I think I found it's a bit more complicated, see my latest comment #1153

check this link, we also observed this before Rander's change, it maybe a SOF bug.
https://github.com/thesofproject/sof-test/blob/main/tools/sof_perf_analyzer.py#L173
Anyway, this is not related with this patch.

You're submitting a change in the way timestamps are handled, so of course we need to understand what is happening with timestamps.

Keqiao did not file bug, just informed Chao and me to fix,

Anyone can file a bug. If only one person could file but then the project would obviously not scale.

the reason I think in commit message already explained and this PR review comments also explained.

I can see you tried but a lot of questions had no answer or the wrong answer. But I think we're making some progress.

this is the right way to go,

Maybe it is but if we don't know why it is then we don't know if it is. Stop "programming by coincidence".

a lot of people will feel upset for the reason, especially for SQE, if this failing blocks milestone by SQE, I would feel more un-comfortable.

Maybe you can just remove the testing and just print "PASS"? This could make everyone "happy and comfortable". We've been quite "successful" with this approach: https://github.com/thesofproject/sof-test/issues?q=+label%3A%22False+Pass+%2F+green+failure%22+

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 17, 2024

I've spent time looking at logs, source code, changes and discussing with @aiChaoSONG (thanks!) in #1153

Please fix the commit message with a summary similar to this and we should be good. Feel free to rephrase in your own words.

lib.sh: perf_analyze(): drop --skip-to-first-trace

The --skip-to-first-trace option was added as a precaution to exclude log pollution from before the start of a test but how much pollution it excludes was never assessed; we just don't know. Hopefully not much? Replace, see below.

mtrace timestamps used to return to zero on every DSP boot, which --skip-to-first-trace depends on. After SOF commit 76b0979c6453 ("sync time between host and fw for logging"), mtrace timestamps still return to zero (briefly) but only on "cold" DSP boots, not on when booting from IMR. Booting from IMR has now been enabled by default in most configurations. That combination breaks --skip-to-first-trace.

The only practical short-term solution is to turn off --skip-to-first-trace and take the risk of old logs polluting the test results, which is what this commit does.

If results pollution proves to be a problem, we could in the longer term try to disable IMR boot just before running this test using sof_debug as discussed in #1153 This would also measure from a possibly cleaner DSP state. New, DSP "debug ops" being proposed in thesofproject/linux#4726 might also help.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 17, 2024

but how much pollution it excludes was never assessed; we just don't know. Hopefully not much?

Actually, we know a little bit. Please replace this with words from @aiChaoSONG in #1153 (comment)

@btian1 btian1 force-pushed the remove_skip_first_trace branch from 78c0eeb to b928a2a Compare January 17, 2024 06:55
@btian1 btian1 changed the title tools: sof_perf_analyzer: remove skip to first trace option lib.sh: perf_analyze(): drop --skip-to-first-trace Jan 17, 2024
@btian1 btian1 force-pushed the remove_skip_first_trace branch from b928a2a to 1ab4988 Compare January 17, 2024 12:45
The --skip-to-first-trace option was added as a precaution to exclude
log pollution from before the start of a test, normally the residue traces are
related to trigger stop IPC, dma put and component reset (as you can see).
There could be 10 ~ 20 lines.

mtrace timestamps used to return to zero on every DSP boot, which --skip-to-first-trace
depends on. After SOF commit 76b0979c6453 ("sync time between host and fw for logging"),
mtrace timestamps still return to zero (briefly) but only on "cold" DSP boots,
not on when booting from IMR. Booting from IMR has now been enabled by default
in most configurations. That combination breaks --skip-to-first-trace.

The only practical short-term solution is to turn off --skip-to-first-trace and take
the risk of old logs polluting the test results, which is what this commit does.

If results pollution proves to be a problem, we could in the longer term try to disable
IMR boot just before running this test using sof_debug as discussed in thesofproject#1153
This would also measure from a possibly cleaner DSP state. New, DSP "debug ops"
being proposed in thesofproject/linux#4726 might also help.

Fixes: thesofproject#1153

Signed-off-by: Baofeng Tian <[email protected]>
Signed-off-by: Chao Song <[email protected]>
Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 18, 2024

I checked all the failures, none is related.

@marc-hb marc-hb merged commit 14c049b into thesofproject:main Jan 18, 2024
2 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants