verify-unique-fs-label: Run `blkid` with a clean cache to avoid stale devices #1288

HuijingHei · 2023-08-31T11:57:56Z

Boot VM failed with following logs, the reproduce rate is 2/4.

systemd[1]: Starting CoreOS Ensure Unique Boot Filesystem...
rdcore[1252]: blkid: error: /dev/sr0: No such file or directory
rdcore[1252]: Error: "blkid" "-p" "/dev/sdb1" "/dev/sr0"
"/dev/sda4" "/dev/sda2" "/dev/sda3" "/dev/sda1" failed with exit
status: 2
systemd[1]: coreos-ignition-unique-boot.service: Main process
exited, code=exited, status=1/FAILURE
coreos-ignition-unique-boot.service: Failed with result 'exit-code'.
Failed to start CoreOS Ensure Unique Boot Filesystem.
systemd[1]: coreos-ignition-unique-boot.service: Triggering
OnFailure= dependencies.

Seems blkid collecting old devices from cache which no longer
exist, should run with a clean cache before gathering the list of
devices.

Fixes https://issues.redhat.com/browse/OCPBUGS-17643

Do testing with this PR, hit the issue occasionally with low rate,
seems can not fix the issue totally.
Also need openshift/os#1350

This reverts commit e650006. The PR to fix the issue is coreos/coreos-installer#1288

cgwalters · 2023-08-31T12:56:00Z

There's something important here to note, which I believe this is a regression from
50e9034
That's why we're only seeing it in relatively recent builds.

Now...I am not quite sure I understand how this would fix the race. As I understand it the problem is basically that because the above commit started bypassing the cache, if the device is removed at any point in this function we can start failing.

Is it possible you just got lucky in testing so far?

How about we skip probing any devices that are removable. (see e.g. lsblk -o NAME,RM for the human-readable output, and this is the rm property in JSON).

BTW, I suspect a reliable and faster way to test this is to write a small systemd unit (bash script e.g.) that runs in the initramfs that creates and removes loopback devices that are mounted as isofs in a tight loop, then it can be trivially tested in qemu too.

dustymabe · 2023-08-31T18:03:02Z

src/blockdev.rs

+    // Refresh the cache to avoid collecting old devices which no longer
+    // exist.
+    runcmd!("blkid", "-s", "none")?;


How about we just change the blkid -o device call below to blkid --cache-file /dev/null -o device and add a comment?

This would be if we didn't end up changing the strategy like Colin mentions in #1288 (comment)

That makes sense, I thought if we refresh the cache, then it would be better for later services which call blkid (to read from new catch), but agree it might have risk, I am OK with the change. Thanks!

Hold on and will do more testing to confirm if the issue is fixed totally.

dustymabe · 2023-08-31T18:05:50Z

How about we skip probing any devices that are removable. (see e.g. lsblk -o NAME,RM for the human-readable output, and this is the rm property in JSON).

We did previously use lsblk for some things here but we moved away from it. Not sure if the reason we moved away still applies or not: #813

cgwalters · 2023-08-31T18:15:49Z

We did previously use lsblk for some things here but we moved away from it. Not sure if the reason we moved away still applies or not: #813

Ah fun, right. Well, one approach is to strip the leading /dev, and then just directly look for /sys/class/block/$device/removable. Or actually, I think we could stop using blkid here and look at /sys/class/block directly...

This reverts commit e650006. The PR to fix the issue is coreos/coreos-installer#1288 Sorry for the confusion, and should do more testing before megerd.

This reverts commit e650006. The PR to fix the issue is coreos/coreos-installer#1288 Sorry for the confusion, and should do more testing before merged.

src/blockdev.rs

HuijingHei · 2023-09-01T03:09:30Z

Is it possible you just got lucky in testing so far?

Yes, seems this can fix the issue, but needs more testing from OCP to provide more confidence.
Do simple testing with the new change and revert PR , boot 10 vms with Standard_DC16ads_v5 CVM, all are able to boot.
Without the patch, the produce rate is 1/5.

BTW, I suspect a reliable and faster way to test this is to write a small systemd unit (bash script e.g.) that runs in the initramfs that creates and removes loopback devices that are mounted as isofs in a tight loop, then it can be trivially tested in qemu too.

SGTM, will look more about this.

HuijingHei · 2023-09-01T09:40:16Z

We did previously use lsblk for some things here but we moved away from it. Not sure if the reason we moved away still applies or not: #813

Ah fun, right. Well, one approach is to strip the leading /dev, and then just directly look for /sys/class/block/$device/removable. Or actually, I think we could stop using blkid here and look at /sys/class/block directly...

Do you mean to skip the device that has removable and value is 1 ?
Seems disk partions do not have removable, only disk like vda/vdb has file removable and value is 0

$ ls /sys/class/block/
sr0  vda  vda1	vda2  vda3  vda4

$ ls /sys/class/block/*/removable
/sys/class/block/sr0/removable	/sys/class/block/vda/removable

[cloud-user@citest-1 ~]$ cat /sys/class/block/vda/removable
0
[cloud-user@citest-1 ~]$ cat /sys/class/block/sr0/removable
1

travier · 2023-09-04T09:37:51Z

What's the difference between -p and -c /dev/null? Why not use -p everywhere? (or the reverse)

HuijingHei · 2023-09-04T09:45:19Z

Testing result with this patch, installed OCP 3 times:

1st time failed, 2 master nodes hit the same issue.
the other 2 times, all nodes are managed to boot.

Seems this can not fix the issue totally, will hit the issue occasionally with low rate.

But I still think this PR is not useless, as this can prevent to collect stale devices from cache. WDYT?

Have a discussion with Azure QE, got info that the /dev/sr0 will be removed by platform when provision successfully.

Do testing with Before=ignition-fetch.service in rhcos-afterburn-checkin.service, got expected error that ignition failed to find /dev/sr0 after checkin finished about 1 second. (Refer to https://bugzilla.redhat.com/show_bug.cgi?id=1980679)

[    9.881886] systemd[1]: Finished Afterburn (Check In - from the initramfs).
...
[   10.732530] ignition[989]: failed to open config device: open /dev/sr0: no such file or directory

There is slightly race condition for any services to get /dev/sr0 that running after checkin, but not 100%, so we need to order coreos-ignition-unique-boot before checkin service, instead of after.
See openshift/os#1350

travier · 2023-09-04T09:57:30Z

While openshift/os#1350 LGTM, this PR seems reasonable as well so we should consider doing both

HuijingHei · 2023-09-05T01:36:07Z

What's the difference between -p and -c /dev/null?

According to blkid man

-c /dev/null means: start with a clean cache
-p <device> means: Switch to low-level superblock probing mode (bypassing the cache).

They both can get new block device attributes as bypass the cache, for the difference, I think -p need device as parameter that we do not know firstly.

Why not use -p everywhere? (or the reverse)

Seems we are using -p to avoid the cache, see 50e9034 and coreos/fedora-coreos-config#2184

Let me know if I misunderstood your meaning.

avoid stale devices Boot VM failed with following logs, the reproduce rate is 2/4. ``` systemd[1]: Starting CoreOS Ensure Unique Boot Filesystem... rdcore[1252]: blkid: error: /dev/sr0: No such file or directory rdcore[1252]: Error: "blkid" "-p" "/dev/sdb1" "/dev/sr0" "/dev/sda4" "/dev/sda2" "/dev/sda3" "/dev/sda1" failed with exit status: 2 systemd[1]: coreos-ignition-unique-boot.service: Main process exited, code=exited, status=1/FAILURE coreos-ignition-unique-boot.service: Failed with result 'exit-code'. Failed to start CoreOS Ensure Unique Boot Filesystem. systemd[1]: coreos-ignition-unique-boot.service: Triggering OnFailure= dependencies. ``` Seems `blkid` collecting old devices from cache which no longer exist, should run with a clean cache before gathering the list of devices. Fixes https://issues.redhat.com/browse/OCPBUGS-17643

cgwalters · 2023-09-05T11:55:58Z

I think this still makes sense to do at some point, but it seems not urgent as the systemd ordering is a stronger fix.

HuijingHei added a commit to HuijingHei/os that referenced this pull request Aug 31, 2023

Revert "azure-checkin: Order Afterburn checkin finish before"

598275e

This reverts commit e650006. The PR to fix the issue is coreos/coreos-installer#1288

HuijingHei force-pushed the blkid-refresh-cache branch from 40f09df to 8826ad5 Compare August 31, 2023 12:26

dustymabe reviewed Aug 31, 2023

View reviewed changes

HuijingHei mentioned this pull request Sep 1, 2023

azure-checkin: order checkin starting after coreos-ignition-unique-boot openshift/os#1350

Merged

HuijingHei force-pushed the blkid-refresh-cache branch from 8826ad5 to b651863 Compare September 1, 2023 01:18

HuijingHei changed the title ~~rdcore: refresh the cache to avoid collecting old devices which no longer exist~~ rdcore: run blkid with a clean cache to avoid collecting old devices which no longer exist Sep 1, 2023

HuijingHei force-pushed the blkid-refresh-cache branch from b651863 to d276746 Compare September 1, 2023 01:21

dustymabe reviewed Sep 1, 2023

View reviewed changes

src/blockdev.rs Outdated Show resolved Hide resolved

HuijingHei force-pushed the blkid-refresh-cache branch from d276746 to 8569ed5 Compare September 1, 2023 02:12

dustymabe approved these changes Sep 1, 2023

View reviewed changes

HuijingHei marked this pull request as ready for review September 5, 2023 07:43

HuijingHei force-pushed the blkid-refresh-cache branch from 8569ed5 to e5abddb Compare September 5, 2023 07:53

HuijingHei changed the title ~~rdcore: run blkid with a clean cache to avoid collecting old devices which no longer exist~~ verify-unique-fs-label: Run blkid with a clean cache to avoid stale devices Sep 5, 2023

HuijingHei merged commit e0690cc into coreos:main Sep 5, 2023
13 checks passed

HuijingHei deleted the blkid-refresh-cache branch September 5, 2023 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verify-unique-fs-label: Run `blkid` with a clean cache to avoid stale devices #1288

verify-unique-fs-label: Run `blkid` with a clean cache to avoid stale devices #1288

HuijingHei commented Aug 31, 2023 •

edited

Loading

cgwalters commented Aug 31, 2023

dustymabe Aug 31, 2023

HuijingHei Sep 1, 2023

HuijingHei Sep 1, 2023

dustymabe commented Aug 31, 2023

cgwalters commented Aug 31, 2023

HuijingHei commented Sep 1, 2023

HuijingHei commented Sep 1, 2023

travier commented Sep 4, 2023

HuijingHei commented Sep 4, 2023

travier commented Sep 4, 2023

HuijingHei commented Sep 5, 2023

cgwalters commented Sep 5, 2023

verify-unique-fs-label: Run blkid with a clean cache to avoid stale devices #1288

verify-unique-fs-label: Run blkid with a clean cache to avoid stale devices #1288

Conversation

HuijingHei commented Aug 31, 2023 • edited Loading

cgwalters commented Aug 31, 2023

dustymabe Aug 31, 2023

Choose a reason for hiding this comment

HuijingHei Sep 1, 2023

Choose a reason for hiding this comment

HuijingHei Sep 1, 2023

Choose a reason for hiding this comment

dustymabe commented Aug 31, 2023

cgwalters commented Aug 31, 2023

HuijingHei commented Sep 1, 2023

HuijingHei commented Sep 1, 2023

travier commented Sep 4, 2023

HuijingHei commented Sep 4, 2023

travier commented Sep 4, 2023

HuijingHei commented Sep 5, 2023

cgwalters commented Sep 5, 2023

verify-unique-fs-label: Run `blkid` with a clean cache to avoid stale devices #1288

verify-unique-fs-label: Run `blkid` with a clean cache to avoid stale devices #1288

HuijingHei commented Aug 31, 2023 •

edited

Loading