Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.11,6.12] Constant I/O (rebalance) when foreground 2x nvme + background 2x HDD when nvme size >> HDD size #799

Open
elmystico opened this issue Dec 11, 2024 · 12 comments

Comments

@elmystico
Copy link

elmystico commented Dec 11, 2024

I/O ATE MY FLASH after two weeks or smth fortunatelly those were not expensive at all somehow old pieces

Having two 256 GiB partitions nvme and two 34 GiB HDD together four partitions
(I'm not using this configuration anymore but I've tried it few times from scratch and this was reproducible each time)
kernel v 6.11 (debian testing)

bcachefs format --fs_label=data --replicas=2 --block_size=4k --background_compression=lz4:1 \ --label=dhdd.tosh4310 /dev/sda3 --label=dhdd.tosh21F0 /dev/sdb3 \ --discard \ --label=dnvme.970evo /dev/nvme0n1p4 \ --label=dnvme.960evo /dev/nvme1n1p4 \ --foreground_target=dnvme --background_target=dhdd

filing some data and put live processes and then

`Size: 534 GiB
Used: 124 GiB
Online reserved: 1.96 MiB

Data type Required/total Durability Devices
reserved: 1/2 [] 151 MiB
btree: 1/2 2 [nvme0n1p4 nvme1n1p4] 4.51 GiB
user: 1/2 2 [sda3 sdb3] 63.5 GiB
user: 1/2 2 [sda3 nvme0n1p4] 977 MiB
user: 1/2 2 [sda3 nvme1n1p4] 961 MiB
user: 1/2 2 [sdb3 nvme0n1p4] 968 MiB
user: 1/2 2 [sdb3 nvme1n1p4] 985 MiB
user: 1/2 2 [nvme0n1p4 nvme1n1p4] 52.3 GiB
cached: 1/1 1 [sda3] 440 KiB
cached: 1/1 1 [sdb3] 384 KiB
cached: 1/1 1 [nvme0n1p4] 14.1 GiB
cached: 1/1 1 [nvme1n1p4] 14.1 GiB

Compression:
type compressed uncompressed average extent size
lz4 51.8 GiB 197 GiB 70.5 KiB
incompressible 147 GiB 147 GiB 70.2 KiB

Btree usage:
extents: 1.19 GiB
inodes: 305 MiB
dirents: 107 MiB
xattrs: 389 MiB
alloc: 677 MiB
reflink: 137 MiB
subvolumes: 512 KiB
snapshots: 512 KiB
lru: 22.5 MiB
freespace: 5.00 MiB
need_discard: 1.00 MiB
backpointers: 1.52 GiB
bucket_gens: 11.0 MiB
snapshot_trees: 512 KiB
deleted_inodes: 512 KiB
logged_ops: 1.00 MiB
rebalance_work: 117 MiB
subvolume_children: 512 KiB
accounting: 69.5 MiB

Pending rebalance work:
54.3 GiB

dhdd.tosh21F0 (device 1): sdb3 rw
data buckets fragmented
free: 1.06 GiB 4339
sb: 3.00 MiB 13 252 KiB
journal: 272 MiB 1088
btree: 0 B 0
user: 32.7 GiB 133824 100 KiB
cached: 0 B 0
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 0 B 0
unstriped: 0 B 0
capacity: 34.0 GiB 139264

dhdd.tosh4310 (device 0): sda3 rw
data buckets fragmented
free: 1.07 GiB 4381
sb: 3.00 MiB 13 252 KiB
journal: 272 MiB 1088
btree: 0 B 0
user: 32.7 GiB 133782 12.0 KiB
cached: 0 B 0
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 0 B 0
unstriped: 0 B 0
capacity: 34.0 GiB 139264

dnvme.960evo (device 3): nvme1n1p4 rw
data buckets fragmented
free: 196 GiB 802284
sb: 3.00 MiB 13 252 KiB
journal: 2.00 GiB 8192
btree: 2.25 GiB 9237
user: 27.1 GiB 111187 360 KiB
cached: 14.1 GiB 117422 14.5 GiB
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 60.3 MiB 241
unstriped: 0 B 0
capacity: 256 GiB 1048576

dnvme.970evo (device 2): nvme0n1p4 rw
data buckets fragmented
free: 197 GiB 808055
sb: 3.00 MiB 13 252 KiB
journal: 2.00 GiB 8192
btree: 2.25 GiB 9237
user: 27.1 GiB 111186 92.0 KiB
cached: 14.1 GiB 110583 12.9 GiB
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 328 MiB 1310
unstriped: 0 B 0
capacity: 256 GiB 1048576`

look at pending rebalance amount

`Device: (unknown device)
External UUID: e9807c87-b09b-4cde-8065-4a475de5e2cb
Internal UUID: 16fc0099-7df6-4ea3-9f4e-49cfc10034c9
Magic number: c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index: 1
Label: data
Version: 1.12: rebalance_work_acct_fix
Version upgrade complete: 1.12: rebalance_work_acct_fix
Oldest version on disk: 1.12: rebalance_work_acct_fix
Created: Fri Nov 15 17:10:58 2024
Sequence number: 75
Time of last write: Sun Dec 1 00:31:50 2024
Superblock size: 5.38 KiB/1.00 MiB
Clean: 0
Devices: 4
Sections: members_v1,replicas_v0,disk_groups,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade
Features: lz4,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
block_size: 4.00 KiB
btree_node_size: 256 KiB
errors: continue [fix_safe] panic ro
metadata_replicas: 2
data_replicas: 2
metadata_replicas_required: 1
data_replicas_required: 1
encoded_extent_max: 64.0 KiB
metadata_checksum: none [crc32c] crc64 xxhash
data_checksum: none [crc32c] crc64 xxhash
compression: none
background_compression: lz4:1
str_hash: crc32c crc64 [siphash]
metadata_target: none
foreground_target: dnvme
background_target: dhdd
promote_target: none
erasure_code: 0
inodes_32bit: 1
shard_inode_numbers: 1
inodes_use_key_cache: 1
gc_reserve_percent: 8
gc_reserve_bytes: 0 B
root_reserve_percent: 0
wide_macs: 0
promote_whole_extents: 1
acl: 1
usrquota: 0
grpquota: 0
prjquota: 0
journal_flush_delay: 1000
journal_flush_disabled: 0
journal_reclaim_delay: 100
journal_transaction_names: 1
allocator_stuck_timeout: 30
version_upgrade: [compatible] incompatible none
nocow: 0

members_v2 (size 592):
Device: 0
Label: tosh4310 (1)
UUID: a04ae694-690c-49fa-999d-c35db9e55b9f
Size: 34.0 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 139264
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,user,cached
Btree allocated bitmap blocksize: 1.00 B
Btree allocated bitmap: 0000000000000000000000000000000000000000000000000000000000000000
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 1
Label: tosh21F0 (2)
UUID: 08632210-3ddf-4290-971d-17bb26f979e4
Size: 34.0 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 139264
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,user,cached
Btree allocated bitmap blocksize: 1.00 B
Btree allocated bitmap: 0000000000000000000000000000000000000000000000000000000000000000
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 2
Label: 970evo (4)
UUID: b94e6dd2-e553-4e03-b6d6-7e39c799267b
Size: 256 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 1048576
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user,cached
Btree allocated bitmap blocksize: 8.00 MiB
Btree allocated bitmap: 0000000010000001100000000000000000000000000000001110010100000101
Durability: 1
Discard: 1
Freespace initialized: 1
Device: 3
Label: 960evo (5)
UUID: b03e2746-b6f5-4474-b692-f5fb70ac0662
Size: 256 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 1048576
Last mount: Sun Dec 1 00:30:28 2024
Last superblock write: 75
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user,cached
Btree allocated bitmap blocksize: 8.00 MiB
Btree allocated bitmap: 0000000000000000100000000000000000000000000000000110010100000101
Durability: 1
Discard: 1
Freespace initialized: 1

errors (size 24):
accounting_mismatch 20 Sun Dec 1 00:30:49 2024`

@nitinkmr333
Copy link

Duplicate of #795

@nitinkmr333
Copy link

@elmystico It shouldn't eat TBW rating of your SSD since only reads are affected (at least in my testing).

@elmystico
Copy link
Author

elmystico commented Dec 16, 2024

Fair enough @nitinkmr333 - I've made e VM jsut for this, 2x32GiB plus 2x16GiB background on smaller pair and I see constant I/O with writes with no reason and no "pending rebalance amount" is changing whatsoever.
Please make test with similar parameters as I'm having perhaps. Please fill filesystem so there will be too much data to fit into background. 2 copies. When rebalance fill background disks it doesn't stop somehow and there's constant r/w IO
Having reporting this it seems not a duplicate of #795 ! Or if it is- #795 would have r/w IO also. This can be unnoticed because bcachefs kernel thread doent show writes but when you look for overall system writes it sure does.
Also have a look for some io measurements inside and outside VM (high r/w!). As soon as I umount fs IO stops to zero.
Zrzut ekranu (182)

Screenshot_disker_2024-12-16_133618

@elmystico
Copy link
Author

Hm after upgrading kernel v6.11 -> v6.12 no more I/O with writes but still full IO saturation perhaps duplicate of #795 (as ypu mentioned @nitinkmr333

Zrzut ekranu (219)

@elmystico
Copy link
Author

Hm I can see that you've been using v6.11 as well @nitinkmr333 perhaps just reboot put writes to stop? Anyway until you reboot it perhaps stucks with r/w IO not read only

@elmystico elmystico changed the title Constant I/O (rebalance) when foreground 2x nvme + background 2x HDD when nvme size >> HDD size [6.11,6.12] Constant I/O (rebalance) when foreground 2x nvme + background 2x HDD when nvme size >> HDD size Dec 16, 2024
@nitinkmr333
Copy link

@elmystico I tested it by creating loopback devices.

On kernel 6.11, I noticed that bcachefs was doing heavy reads but not writes. However, my underlying filesystem btrfs (on which I created loopback devices) was doing same amount of writes (perhaps btrfs is rewriting some data because of loopback devices?).
image
I believe it is the same case as yours. In the image you shared (using iotop), I can see bcachefs is doing reads but writes are probably done by your underlying filesystem (where qcow2 images are created).

After upgrading to kernel 6.12.2, I noticed that underlying filesystem btrfs is no longer doing those writes on the same setup. There are only reads now (by bcachefs)-
Screenshot_20241217_095246_crop

I also checked real hardware (sd card and hard drive) by creating 2 partitions - foreground and background target on each on them. There were reads but no writes on the filesystem (even on kernel 6.11) after filling background target partitions.

Rebooting or remounting these bcachefs drives does not make any difference in my case.

I will try your VM setup.

@elmystico
Copy link
Author

elmystico commented Dec 17, 2024

I believe it is the same case as yours. In the image you shared (using iotop), I can see bcachefs is doing reads but writes are probably done by your underlying filesystem (where qcow2 images are created).

Yeah I've seen that with bare metal bcachefs as well. It looks like bch-rebalance kthread hides its write io inside different thread or something like that because it could not be seen directly but you can see the write IO done on the drive level.

Anyway I think it's ok now to wait for Kent's or other dev reaction we don't know if and what more info is needed to fix

@nitinkmr333
Copy link

nitinkmr333 commented Dec 22, 2024

I tried bcachefs inside cachyos VM (kernel 6.12, kvm/qemu VM), similar to your setup (2 x SSD, 2 x HDD; and filled the background_target completely) and I can only see constant heavy reads on both- host (NixOS, kernel 6.12) and inside cachyos VM, but there no writes (or minimal writes from other services).

I guess the heavy "writes" issue has been solved with 6.12 kernel and might only be present in 6.11.

@nitinkmr333
Copy link

nitinkmr333 commented Dec 29, 2024

Adding my comment from duplicate issue #795
#795 (comment)

It looks like the issue is related to Pending rebalance work and not just background_target itself. We face this bug if there is Pending rebalance work that needs to be done, but cannot be completed for some reason (maybe we are constantly rescanning the pending rebalance, resulting in I/O?).

For example, we can create a filesystem with 2 disks (foreground_target=ssd, background_target=hdd, replicas=1), and write some data with data_replicas=2-

❯ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0  1000M  0 loop /mnt
loop1         7:1    0  1000M  0 loop

show-super-

❯ sudo bcachefs show-super /dev/loop0
Device:                                     (unknown device)
External UUID:                             a7f2bff7-29ee-4e49-9e01-0cbe16c7332a
Internal UUID:                             63201fc7-0bff-4422-baa4-d243bc81a483
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              0
Label:                                     (none)
Version:                                   1.13: inode_has_child_snapshots
Version upgrade complete:                  1.13: inode_has_child_snapshots
Oldest version on disk:                    1.13: inode_has_child_snapshots
Created:                                   Sun Dec 29 14:58:13 2024
Sequence number:                           20
Time of last write:                        Sun Dec 29 15:01:46 2024
Superblock size:                           4.67 KiB/1.00 MiB
Clean:                                     0
Devices:                                   2
Sections:                                  members_v1,replicas_v0,disk_groups,clean,journal_v2,counters,members_v2,errors,ext,downgrade
Features:                                  new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                           alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                              512 B
  btree_node_size:                         128 KiB
  errors:                                  continue [fix_safe] panic ro 
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash 
  data_checksum:                           none [crc32c] crc64 xxhash 
  compression:                             none
  background_compression:                  none
  str_hash:                                crc32c crc64 [siphash] 
  metadata_target:                         none
  foreground_target:                       ssd
  background_target:                       hdd
  promote_target:                          none
  erasure_code:                            0
  inodes_32bit:                            1
  shard_inode_numbers:                     1
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      8
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   1
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none 
  nocow:                                   0

members_v2 (size 304):
Device:                                    0
  Label:                                   ssd (0)
  UUID:                                    72944387-5f76-407e-8152-6e25d95d8cc3
  Size:                                    1000 MiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             128 KiB
  First bucket:                            0
  Buckets:                                 8000
  Last mount:                              Sun Dec 29 15:01:46 2024
  Last superblock write:                   20
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                journal,btree,user
  Btree allocated bitmap blocksize:        4.00 KiB
  Btree allocated bitmap:                  0000010000000000000000000000000000000000000000000000000001100000
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   1
Device:                                    1
  Label:                                   hdd (1)
  UUID:                                    75c87134-d657-4ae5-91aa-4fff722d2a11
  Size:                                    1000 MiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             128 KiB
  First bucket:                            0
  Buckets:                                 8000
  Last mount:                              Sun Dec 29 15:01:46 2024
  Last superblock write:                   20
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                user
  Btree allocated bitmap blocksize:        1.00 B
  Btree allocated bitmap:                  0000000000000000000000000000000000000000000000000000000000000000
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   1

Now, write some data to a folder having data_replicas=2 (using xattr)-

cd /mnt
sudo mkdir data_xattr
sudo bcachefs set-file-option --data_replicas=2 data_xattr
sudo dd if=/dev/zero of=data_xattr/file bs=200M count=1 status=progress

We have enough free space in the background_target but can only store 1 replica, hence there is Pending rebalance work-

❯ sudo bcachefs fs usage -h /mnt
Filesystem: a7f2bff7-29ee-4e49-9e01-0cbe16c7332a
Size:                       1.80 GiB
Used:                        403 MiB
Online reserved:                 0 B

Data type       Required/total  Durability    Devices
btree:          1/1             1             [loop0]             3.13 MiB
user:           1/2             2             [loop0 loop1]        400 MiB

Btree usage:
extents:             512 KiB
inodes:              128 KiB
dirents:             128 KiB
alloc:               640 KiB
subvolumes:          128 KiB
snapshots:           128 KiB
lru:                 128 KiB
freespace:           128 KiB
need_discard:        128 KiB
backpointers:        640 KiB
bucket_gens:         128 KiB
snapshot_trees:      128 KiB
rebalance_work:      128 KiB
accounting:          128 KiB

Pending rebalance work:
200 MiB

hdd (device 1):                loop1              rw
                                data         buckets    fragmented
  free:                      789 MiB            6313
  sb:                       3.00 MiB              25       124 KiB
  journal:                  7.75 MiB              62
  btree:                         0 B               0
  user:                      200 MiB            1600
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                 1000 MiB            8000

ssd (device 0):                loop0              rw
                                data         buckets    fragmented
  free:                      786 MiB            6288
  sb:                       3.00 MiB              25       124 KiB
  journal:                  7.75 MiB              62
  btree:                    3.13 MiB              25
  user:                      200 MiB            1600
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                 1000 MiB            8000

This causes heavy reads on filesystem.

@aviallon
Copy link

aviallon commented Jan 7, 2025

I experience a similar issue, where I have 3 x 1GiB foreground/promote/metadata targets + 1x4 TiB background_target, and having data_replicas=2.
As soon as I try to write more than 3GiB of data, bch-rebalance gets mad trying to read to death the foreground targets, while Pending rebalance work stays stuck to slightly less than 3 GiB.
I was able to stop the madness without rebooting thanks to bcachefs set-file-option --data_replicas=1.

@alexminder
Copy link

@aviallon you hav background_target only 4TB, but if you write 3TB with 2 replicas you need 6 TB minimum for this target and plus reserved for rebalance ~8%. You should:

  • add more space for background target; or
  • reduce replicas; or
  • disable background target.

I expeareance same behavour and imho bcachefs should handle it wise.

I beleve it is another issue.

@aviallon
Copy link

@alexminder I was writing 3 Gigabytes, not Terabytes.
And in my honest opinion, the filesystem should return EONSPC instead of just stalling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants