-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verify-unique-fs-label: Run blkid
with a clean cache to avoid stale devices
#1288
Conversation
This reverts commit e650006. The PR to fix the issue is coreos/coreos-installer#1288
40f09df
to
8826ad5
Compare
There's something important here to note, which I believe this is a regression from Now...I am not quite sure I understand how this would fix the race. As I understand it the problem is basically that because the above commit started bypassing the cache, if the device is removed at any point in this function we can start failing. Is it possible you just got lucky in testing so far? How about we skip probing any devices that are removable. (see e.g. BTW, I suspect a reliable and faster way to test this is to write a small systemd unit (bash script e.g.) that runs in the initramfs that creates and removes loopback devices that are mounted as |
src/blockdev.rs
Outdated
// Refresh the cache to avoid collecting old devices which no longer | ||
// exist. | ||
runcmd!("blkid", "-s", "none")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just change the blkid -o device
call below to blkid --cache-file /dev/null -o device
and add a comment?
This would be if we didn't end up changing the strategy like Colin mentions in #1288 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I thought if we refresh the cache, then it would be better for later services which call blkid (to read from new catch), but agree it might have risk, I am OK with the change. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on and will do more testing to confirm if the issue is fixed totally.
We did previously use |
Ah fun, right. Well, one approach is to strip the leading |
This reverts commit e650006. The PR to fix the issue is coreos/coreos-installer#1288 Sorry for the confusion, and should do more testing before megerd.
This reverts commit e650006. The PR to fix the issue is coreos/coreos-installer#1288 Sorry for the confusion, and should do more testing before merged.
8826ad5
to
b651863
Compare
blkid
with a clean cache to avoid collecting old devices which no longer exist
b651863
to
d276746
Compare
d276746
to
8569ed5
Compare
Yes, seems this can fix the issue, but needs more testing from OCP to provide more confidence.
SGTM, will look more about this. |
Do you mean to skip the device that has
|
What's the difference between |
Testing result with this patch, installed OCP 3 times:
Seems this can not fix the issue totally, will hit the issue occasionally with low rate. But I still think this PR is not useless, as this can prevent to collect stale devices from cache. WDYT? Have a discussion with Azure QE, got info that the Do testing with
There is slightly race condition for any services to get |
While openshift/os#1350 LGTM, this PR seems reasonable as well so we should consider doing both |
According to blkid man
They both can get new block device attributes as bypass the cache, for the difference, I think
Seems we are using Let me know if I misunderstood your meaning. |
avoid stale devices Boot VM failed with following logs, the reproduce rate is 2/4. ``` systemd[1]: Starting CoreOS Ensure Unique Boot Filesystem... rdcore[1252]: blkid: error: /dev/sr0: No such file or directory rdcore[1252]: Error: "blkid" "-p" "/dev/sdb1" "/dev/sr0" "/dev/sda4" "/dev/sda2" "/dev/sda3" "/dev/sda1" failed with exit status: 2 systemd[1]: coreos-ignition-unique-boot.service: Main process exited, code=exited, status=1/FAILURE coreos-ignition-unique-boot.service: Failed with result 'exit-code'. Failed to start CoreOS Ensure Unique Boot Filesystem. systemd[1]: coreos-ignition-unique-boot.service: Triggering OnFailure= dependencies. ``` Seems `blkid` collecting old devices from cache which no longer exist, should run with a clean cache before gathering the list of devices. Fixes https://issues.redhat.com/browse/OCPBUGS-17643
8569ed5
to
e5abddb
Compare
blkid
with a clean cache to avoid collecting old devices which no longer existblkid
with a clean cache to avoid stale devices
I think this still makes sense to do at some point, but it seems not urgent as the systemd ordering is a stronger fix. |
Boot VM failed with following logs, the reproduce rate is 2/4.
Seems
blkid
collecting old devices from cache which no longerexist, should run with a clean cache before gathering the list of
devices.
Fixes https://issues.redhat.com/browse/OCPBUGS-17643
Do testing with this PR, hit the issue occasionally with low rate,
seems can not fix the issue totally.
Also need openshift/os#1350