Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#4547: Refactor vmm builder code to simplify logic that creates the microVM to boot #4910

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

tommady
Copy link
Contributor

@tommady tommady commented Nov 13, 2024

fix(4547): Refactor vmm builder code to simplify logic

Changes

Refactoring Vmm builder code

Reason

close issue #4547

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

pass vm_config to eliminate two extra arguments derived from it

Signed-off-by: tommady <[email protected]>
remove cfg_attr and extract create_vcpus from create_vmm_and_vcpus

Signed-off-by: tommady <[email protected]>
@tommady tommady force-pushed the fix-4547 branch 2 times, most recently from fce4afd to 38e9537 Compare November 15, 2024 21:38
extract codes into two architecture specific modes

Signed-off-by: tommady <[email protected]>
eliminate the unnecessary usage of the event_manager argument

Signed-off-by: tommady <[email protected]>
@tommady tommady marked this pull request as ready for review November 15, 2024 21:56
Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tommady, thx for doing the PR. Overall changes seems solid, but I left some small notes here and there. Also can you change names of the commits to refactor(builder): .... This is usually how we treat refactor commits.

src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
src/vmm/src/builder.rs Outdated Show resolved Hide resolved
pb8o and others added 6 commits November 22, 2024 03:14
Add the SVE CPU template as a valid template in 5.10 since it works.

Signed-off-by: Pablo Barbáchano <[email protected]>
Update release policy to v1.10.1 patch

Signed-off-by: Jack Thomson <[email protected]>
`TcpIPv4Handler` for MMDS network stack preallocates several buffers
whose sizes are saved into a snapshot as `max_connections` and
`max_pending_resets` in `MmdsNetworkStackState`. But they are always the
same constant hardcoded values (`DEFAULT_MAX_CONNECTIONS` and
`DEFAULT_MAX_PENDING_RESETS`) as of today, which means there is no need
to save them into a snapshot. Even if we change the hardcoded sizes
across Firecracker versions, that should not be a problem. This is
because the snapshot feature does not support migration of network
connections and those buffers are initialized with empty on snapshot
restoration. When migrating from a Firecracker version with larger
buffers to another version with smaller ones, guest workloads that
worked previously might start to fail due to the less buffer spaces.
However, the issue is not a problem of the snapshot feature and it
should also occur even on a purely booted microVM (not restored from a
snapshot). Thus, it is fine to remove those fields from a snapshot.

Since this is a breaking change of the snapshot format, bumps the major
version.

Signed-off-by: Takahiro Itazuri <[email protected]>
There is no need to use MmdsNetworkStack::new() instead of
MmdsNetworkStack::new_with_defaults() in tests that pass the same
default values.

Signed-off-by: Takahiro Itazuri <[email protected]>
We bumped the snapshot version up twice recently, requiring users to
regenerate their snapshot, but the user action isn't clearly stated.

Signed-off-by: Takahiro Itazuri <[email protected]>
eliminate the unnecessary usage of the event_manager argument and
fix up aarch64 attach_legacy_devices_aarch64 fn

Signed-off-by: tommady <[email protected]>
remove the aarch64 suffix from the attach_legacy_devices_aarch64
function and ensure that aarch64 smt is always set to false int the
configure_system_for_boot function

Signed-off-by: tommady <[email protected]>
@tommady
Copy link
Contributor Author

tommady commented Nov 22, 2024

Hi @ShadowCurse

I’ve addressed your comments, with one note regarding the set_stdout_nonblocking function. Please review and provide your guidance when you have a moment.

Thank you!

Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you need to make sure the changes you made pass our CI. You can look at the buildkite/firecracker-pr line at the bottom of the PR and click details to look at the CI status. Start with making sure changes do compile. I can see that on x86_86 only errors are absence of doc comments for public functions. For aarch64, well, it does not even compile. If you don't have access to arm system, you can try to use cross to cross compile.
After everything compiles, please move all changes to corresponding commits.
Also please rebase the PR so it only includes your commits.

zulinx86 and others added 3 commits November 23, 2024 06:34
`TcpIPv4Handler` for MMDS network stack preallocates several buffers
whose sizes are saved into a snapshot as `max_connections` and
`max_pending_resets` in `MmdsNetworkStackState`. But they are always the
same constant hardcoded values (`DEFAULT_MAX_CONNECTIONS` and
`DEFAULT_MAX_PENDING_RESETS`) as of today, which means there is no need
to save them into a snapshot. Even if we change the hardcoded sizes
across Firecracker versions, that should not be a problem. This is
because the snapshot feature does not support migration of network
connections and those buffers are initialized with empty on snapshot
restoration. When migrating from a Firecracker version with larger
buffers to another version with smaller ones, guest workloads that
worked previously might start to fail due to the less buffer spaces.
However, the issue is not a problem of the snapshot feature and it
should also occur even on a purely booted microVM (not restored from a
snapshot). Thus, it is fine to remove those fields from a snapshot.

Since this is a breaking change of the snapshot format, bumps the major
version.

Signed-off-by: Takahiro Itazuri <[email protected]>
We bumped the snapshot version up twice recently, requiring users to
regenerate their snapshot, but the user action isn't clearly stated.

Signed-off-by: Takahiro Itazuri <[email protected]>
let the x86_64 and aarch64 architectures code can compile and without
warnning

Signed-off-by: tommady <[email protected]>
@tommady
Copy link
Contributor Author

tommady commented Nov 23, 2024

Hi @ShadowCurse

Thank you for your patience and thorough review—it has been an incredible learning experience for me!
I’ve revisited your feedback and ensured that the entire architecture compiles cleanly, with no documentation warnings.
Please take a look whenever you have the time. 😊

Copy link
Contributor

@ShadowCurse ShadowCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that it at least compiles, you need to make sure it also does function as expected. If you look at the Buildkite pipelines output: https://buildkite.com/firecracker/firecracker-pr/builds/11856#0193640b-6e7c-4f49-bbcf-cb60c1c946ef/62-392 you will see that there are some unit tests that fail in the same place src/vmm/src/builder.rs:519:51. Quick look around and it seems the issue is that when pio_device_manager registers itself, it registers interrupts for a VM, but the interrupt controller is not yet created at that point. So you need to either move creation of interrupt_controller before pio_device_manager creation, or move this .register_device(..) call.

You can reproduce all of this errors if you run CI locally. You can find exact commands CI is running looking at the very beginning of the logs like this: https://buildkite.com/firecracker/firecracker-pr/builds/11856#0193640b-6e7c-4f49-bbcf-cb60c1c946ef/62-63

Start by running unit tests. There are a couple of ways to do this:

  • ./tools/devtool test -- integration_tests/build/test_unittests.py - this will use the docker container to compile and run tests
  • Compile test with cargo test -p vmm --no-run and then run with sudo ./build/cargo_target/debug/deps/vmm-.... (there will be a full path at the end of cargo test) With this method some network related unit tests will fail most likely, but you can skip those.

Also you can make sure your changes are sound by to simply trying to launch a VM with them. We have a doc for this: https://github.com/firecracker-microvm/firecracker/blob/main/docs/getting-started.md (in the section about getting a rootfs you can skip the ssh key creation if you don't want to use/test network)

make the test all passed

Signed-off-by: tommady <[email protected]>
@tommady
Copy link
Contributor Author

tommady commented Nov 27, 2024

Hi @ShadowCurse,
Thank you once again for your patience and for crafting such a detailed guide! 🙏
I’ve fixed the unit test as per your suggestions and ran the command:

./tools/devtool test -- integration_tests/build/test_unittests.py
...
integration_tests/build/test_unittests.py::test_unittests PASSED                                                                                         [ 50%]
integration_tests/build/test_unittests.py::test_benchmarks_compile PASSED                                                                                [100%]

All tests passed successfully! 🎉

When you have a moment, please review it again.

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 66.40625% with 86 lines in your changes missing coverage. Please review.

Project coverage is 84.76%. Comparing base (4c33853) to head (34e993b).

Current head 34e993b differs from pull request most recent head 56a59a2

Please upload reports for the commit 56a59a2 to get more accurate results.

Files with missing lines Patch % Lines
src/vmm/src/builder.rs 66.00% 86 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4910      +/-   ##
==========================================
+ Coverage   84.08%   84.76%   +0.67%     
==========================================
  Files         251      223      -28     
  Lines       28058    25666    -2392     
==========================================
- Hits        23592    21755    -1837     
+ Misses       4466     3911     -555     
Flag Coverage Δ
5.10-c5n.metal 84.57% <66.40%> (-0.09%) ⬇️
5.10-m5n.metal 84.55% <66.40%> (-0.09%) ⬇️
5.10-m6a.metal 83.86% <66.40%> (-0.09%) ⬇️
5.10-m6g.metal ?
5.10-m6i.metal 84.55% <66.40%> (-0.09%) ⬇️
5.10-m7g.metal ?
6.1-c5n.metal 84.57% <66.40%> (-0.08%) ⬇️
6.1-m5n.metal 84.55% <66.40%> (-0.09%) ⬇️
6.1-m6a.metal 83.85% <66.40%> (-0.09%) ⬇️
6.1-m6g.metal ?
6.1-m6i.metal 84.55% <66.40%> (-0.09%) ⬇️
6.1-m7g.metal ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ShadowCurse
Copy link
Contributor

With x86_64 version done, now you need to make sure aarhc64 also compiles/works. If you look at the BK run again: https://buildkite.com/firecracker/firecracker-pr/builds/11886#01936ce9-9b52-4459-ac0a-f8cfbe7a3690/62-624, https://buildkite.com/firecracker/firecracker-pr/builds/11886#01936ce9-9b62-4713-b1eb-33926eacc8b0/62-882 you will see some compilation errors in unit tests and CI test for serial port in a restored VM failing. As I mentioned before, you can use cross to cross compile FC to aarch64. It might even allow some subset of unit tests to be run locally. Or you can use qemu to create an emulated aarch64 VM and test your code there.

@tommady
Copy link
Contributor Author

tommady commented Nov 29, 2024

Oh, I see… I assumed the devtool test would handle all platform testing in one sweep. Let me double-check again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants