Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Verilator to version 5.006 #158

Merged
merged 10 commits into from
Aug 1, 2024
Merged

Update Verilator to version 5.006 #158

merged 10 commits into from
Aug 1, 2024

Conversation

colluca
Copy link
Collaborator

@colluca colluca commented Jul 8, 2024

General

Details

  • Add --timing to support timing constructs e.g. as will be present in iDMA tracer.
  • Add VLT_JOBS option to control the number of threads, useful to keep memory usage under control.
  • Verilator 5.006 seems to require more memory than 4.110, specifically >16GB also with a single job, causing the Github CI to fail. We attempted hierarchical Verilation on the Snitch core complex but this fails due to missing support for type parameters (see Support 'parameter type' in hierarchical blocks verilator/verilator#5309). To solve this issue we provide the github-ci.hjson config with a reduced core count (4 cores) for use in the Github CI.
  • The out-of-memory failure occurred only with the default HW config default.hjson, and not with fdiv.hjson. In any case it's best to align all configurations to the default, so this is included as part of the PR.
  • Generalize all tests to work with an arbitrary number of cores (>4).
  • Temporarily disable ATAX test, to be fixed in a separate commit.

@zero9178
Copy link
Contributor

zero9178 commented Jul 9, 2024

I tested this PR in our downstream docker image and testsuite (https://github.com/opencompl/Quidditch/blob/main/runtime/toolchain/Dockerfile) in part to evaluate whether this would then also close #148.

Few things sadly happened as soon as I used the newer SHA with the newer verilator release:

  • The verilated model seems to crash immediately if compiled with clang even on the newest clang versions (tested 14 and 18.1). I suspected a stack overflow initially but it also crashes with an unlimited stack, so I am not sure.
    It would be a shame to loose clang as using clang previously cut our simulation time by 1/5. Applying PGO to the verilator build further reduced simulation time by a further 50% (tho this could also be done with GCC).
    I see the same behaviour in Verilator 5.026.
  • Even when building with GCC (we are running with GCC 12.1, tad newer than the GCC 11.1 in Ubuntu 22.04), half the tests in snitch_cluster are failing without yet using VLT_NUM_THREADS. I haven not yet further looked into why. I might also be doing something wrong though as it seems to work on our executables. I also noticed that the log files have different names now (trace_hart_0000x.dasm) so maybe this is also the cause.
  • Compiling it with 9 threads causes a bit of a slowdown on single threaded executables (not unexpected). I am running tests now that are mostly multi threaded (as in all cores of the cluster) workloads now.

@zero9178
Copy link
Contributor

zero9178 commented Jul 9, 2024

So the good news is that the new verilator model is a lot faster running a whole NN that leverages all 9 cores of a snitch cluster. Previously it took 5076 seconds; which has now been reduced to 2200 seconds, i.e. more than 50%.

I used VLT_NUM_THREADS=9, thinking that is a reasonable inherent parallism of the verilator model but a more optimal one may exist (our machine have 16 threads to play with). For GitHub actions we would more likely scale this back to 4 at most probably.

The correctness issue in #148 seems to have also been fixed

@colluca
Copy link
Collaborator Author

colluca commented Jul 11, 2024

@zero9178 thanks for the detailed testing and report!

The last results look promising, so I would close #148 in favour of this. Did you manage to test this with GCC 11.1 in the end? Or were you able to fix the previous errors?

Regarding the log names this should be due to #58, but anyways the traces shouldn't affect the correctness of any of the software tests.

@colluca colluca mentioned this pull request Jul 11, 2024
@colluca colluca force-pushed the update-verilator branch from 720c1d7 to e5f7611 Compare July 11, 2024 13:47
@zero9178
Copy link
Contributor

@zero9178 thanks for the detailed testing and report!

The last results look promising, so I would close #148 in favour of this. Did you manage to test this with GCC 11.1 in the end? Or were you able to fix the previous errors?

I sadly haven't gotten around to doing this yet and I am currently short on time, but will do so as soon as possible.
(Hopefully by next week when we're in Zurich 🙂)

@colluca
Copy link
Collaborator Author

colluca commented Jul 15, 2024

I sadly haven't gotten around to doing this yet and I am currently short on time, but will do so as soon as possible. (Hopefully by next week when we're in Zurich 🙂)

Looking forward to meeting you and we can surely discuss this in that setting as well :)

@fischeti fischeti mentioned this pull request Jul 19, 2024
2 tasks
@colluca colluca force-pushed the update-verilator branch from e5f7611 to 3ca593a Compare July 22, 2024 07:06
@zero9178
Copy link
Contributor

I believe I have found and fixed the issue that affected Clang and potentially the newer GCC versions as well. It seemed to have been a stack overflow after all. Culprit are the files/functions that verilator designates as Slow and therefore compiled with -O0. Since neither GCC nor Clang performaned any optimizations, code in these files had excessive stack usage. Adding OPT_SLOW="-O1" to my make invocation when building bin/snitch_cluster.vlt fixes it and made the clang built verilator also pass every test in the repo.

@colluca
Copy link
Collaborator Author

colluca commented Jul 23, 2024

I believe I have found and fixed the issue that affected Clang and potentially the newer GCC versions as well. It seemed to have been a stack overflow after all. Culprit are the files/functions that verilator designates as Slow and therefore compiled with -O0. Since neither GCC nor Clang performaned any optimizations, code in these files had excessive stack usage. Adding OPT_SLOW="-O1" to my make invocation when building bin/snitch_cluster.vlt fixes it and made the clang built verilator also pass every test in the repo.

Great news! 🎉 I will proceed with the merge then :)

@colluca colluca force-pushed the update-verilator branch from 3ca593a to 40cff28 Compare July 23, 2024 14:04
@colluca colluca marked this pull request as ready for review July 23, 2024 14:26
@colluca colluca force-pushed the update-verilator branch from 40cff28 to 257229e Compare July 26, 2024 16:05
@colluca colluca requested a review from lucabertaccini as a code owner July 26, 2024 23:04
@colluca colluca force-pushed the update-verilator branch 2 times, most recently from 8df70ab to 54aeece Compare July 27, 2024 14:44
@colluca colluca requested a review from viv-eth as a code owner July 27, 2024 15:35
@colluca colluca force-pushed the update-verilator branch from 2187057 to 2362ebb Compare July 28, 2024 10:01
Copy link
Contributor

@fischeti fischeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! thanks a lot. The CI should pass again with #172

colluca and others added 3 commits August 1, 2024 10:15
Using more threads in the simulation may lead to a speedup in simulation time, especially when the program running in the simulator makes use of multiple cores.
This PR therefore adds the `VLT_NUM_THREADS` option to `make` to allow users to set the thread number used in the simulator. For backwards compatibility, and since the number of threads is environment and code dependent, the default value of 1 was kept.
@colluca colluca merged commit 51c0300 into main Aug 1, 2024
23 of 25 checks passed
@colluca colluca deleted the update-verilator branch August 1, 2024 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants