[Proposal] Change zstd compression levels mapping #643

LRFLEW · 2024-01-28T06:05:53Z

I noticed that bcachefs is mapping its compression levels (1-15) to zstd's compression levels (1-22) using floor(level * 3 / 2). This PR proposes to change this mapping ~~to be direct, and use only zstd compression levels 1-15~~ (see later comments about the new mapping). My reasoning for this change is as follows:

This mapping, requiring integer truncation, results in an uneven mapping of compression levels. Increasing the bcachefs compression level by one will increase the underlying compression level by either one or two depending on the starting level. Using a direct mapping of compression levels fixes this.
bcachefs uses a default compression level of 3, which seems to map to zstd's default compression level of 3. However, the remapping of compression levels means that a bcachefs compression level of 3 is actually a zstd compression level of 4, which is more than the default for zstd. Using a direct mapping will bring the default compression levels of bcachefs and zstd in line with each other.
Referring to zstd's compression levels as being "from 1 to 22" is a bit disingenuous, as it ignores the seven "negative" compression levels (accessed in the command line tool with --fast=N). Having bcachefs ignore some of the higher compression levels is not a significant deal, considering that it's already ignoring some of the lower compression levels.
Similarly, with the command-line tool zstd, the compression levels 20-22 are not accessible without an additional command-line argument (--ultra) to tell it you really want to use compression levels that high. Counting those higher compression levels as being equivalent to the others can be misleading, and excluding them shouldn't be a significant change.
btrfs's zstd compression also has 15 compression levels, but they map their compression levels to zstd's compression levels 1-15 directly. Changing bcachefs's compression level mapping to match btrfs means that existing benchmarks for btrfs compression levels will be more accurate for bcachefs.

If keeping access to these higher compression levels is a priority, then perhaps a non-linear mapping would be better suited. One option would be to map the lower levels directly to the zstd levels, and then have the higher levels skip every other compression level (clamped to <=22). This would provide a compromise between providing more low compression levels and maintaining access to the max compression level, all while addressing the above issues. Let me know if you'd rather this solution, and I can update the PR with it.

boomshroom · 2024-01-28T21:55:32Z

Since background_compression happens, well, in the background, it's not as important for it to be fast and it makes sense to let the system use spare background cycles getting the maximum possible compression for very infrequently accessed files. As such, being able to access level 22 for this purpose would be nice and I'd appreciate not losing it.

If we wanted a piecewise-linear mapping with a slope of 1 at (0,0) and a slope of 2 at (15,22), then they would meet at exactly (8,8), giving one extra point to the lower range.

LRFLEW · 2024-01-29T01:56:19Z

I noticed that piecewise-linear mapping while experimenting with it, and I do think that's a good option. The only issue I have with it is that levels 20-22 might want to be considered separate.

Levels >= 20, labeled --ultra, should be used with caution, as they require more memory. (source)

Because of that, I think level 19, the maximum non-ultra level, should be included in the mapping. As such, I would propose using the mapping points (0,0) and (14,19), giving an intersection point of (9,9). Then the top level 15 would be mapped to the top ultra level 22. The only possible issue I see is that it adds additional complexity to the mapping.

This prevents going emergency read only when the user has specified replicas_required > replicas. Signed-off-by: Kent Overstreet <[email protected]>

Signed-off-by: Kent Overstreet <[email protected]>

We don't want journal write completions to be blocked behind btree transactions - io_complete_wq is used for btree updates after data and metadata writes. Signed-off-by: Kent Overstreet <[email protected]>

Signed-off-by: Kent Overstreet <[email protected]>

Previously, any time we failed to get a journal reservation we'd retry, with the journal lock held; but this isn't necessary given wait_event()/wake_up() ordering. This avoids performance cliffs when the journal starts to get backed up and lock contention shoots up. Signed-off-by: Kent Overstreet <[email protected]>

Signed-off-by: Kent Overstreet <[email protected]>

For DT_SUBVOL, we now print both parent and child subvol IDs. Signed-off-by: Kent Overstreet <[email protected]>

Most bcachefs workqueues are used for completions, and should be WQ_HIGHPRI - this helps reduce queuing delays, we want to complete quickly once we can no longer signal backpressure by blocking. Signed-off-by: Kent Overstreet <[email protected]>

Minor renaming for clarity, bit of refactoring. Signed-off-by: Kent Overstreet <[email protected]>

Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __ from the name - this is a normal interface. Signed-off-by: Kent Overstreet <[email protected]>

Eliminates some error paths - no longer have a hardcoded BCH_REPLICAS_MAX limit. Signed-off-by: Kent Overstreet <[email protected]>

Signed-off-by: Kent Overstreet <[email protected]>

This gives us a way to record the date and time every journal entry was written - useful for debugging. Signed-off-by: Kent Overstreet <[email protected]>

Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <[email protected]>

Recently a severe performance regression was discovered, which bisected to a6548c8 bcachefs: Avoid flushing the journal in the discard path It turns out the old behaviour, which issued excessive journal flushes, worked around a performance issue where queueing delays would cause the journal to not be able to write quickly enough and stall. The journal flushes masked the issue because they periodically flushed the device write cache, reducing write latency for non flushes. This patch reworks the journalling code to allow more than one (non-flush) write to be in flight at a time. With this patch, doing 4k random writes and an iodepth of 128, we are now able to hit 560k iops to a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range. Signed-off-by: Kent Overstreet <[email protected]>

we now always have a btree_trans when using a btree_and_journal_iter; prep work for adding prefetching to btree_and_journal_iter Signed-off-by: Kent Overstreet <[email protected]>

btree_and_journal_iter is old code that we want to get rid of, but we're not ready to yet. lack of btree node prefetching is, it turns out, a real performance issue for fsck on spinning rust, so - add it. Signed-off-by: Kent Overstreet <[email protected]>

Files within a subvolume cannot be renamed into another subvolume, but subvolumes themselves were intended to be. This implements subvolume renaming - we need to ensure that there's only a single dirent that points to a subvolume key (not multiple versions in different snapshots), and we need to ensure that dirent.d_parent_subol and inode.bi_parent_subvol are updated. Signed-off-by: Kent Overstreet <[email protected]>

switch the statfs code from something horrible and open coded to the more standard uuid_to_fsid() Signed-off-by: Kent Overstreet <[email protected]>

koverstreet force-pushed the master branch 6 times, most recently from d0860cf to 253d7fb Compare February 10, 2024 22:41

Kent Overstreet added 22 commits February 11, 2024 00:58

bcachefs: Clamp replicas_required to replicas

67a8e7d

This prevents going emergency read only when the user has specified replicas_required > replicas. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Kill unnecessary wakeups in journal reclaim

b360f29

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Split out journal workqueue

8cc173c

We don't want journal write completions to be blocked behind btree transactions - io_complete_wq is used for btree updates after data and metadata writes. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Avoid setting j->write_work unnecessarily

02e47aa

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Journal writes should be REQ_SYNC|REQ_META

17dbd06

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: fixup for building in userspace

b9ce48e

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Improve bch2_dirent_to_text()

d7b8ef8

For DT_SUBVOL, we now print both parent and child subvol IDs. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: bch2_hash_set_snapshot() -> bch2_hash_set_in_snapshot()

e06a68f

Minor renaming for clarity, bit of refactoring. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Cleanup bch2_dirent_lookup_trans()

c7e84c2

Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __ from the name - this is a normal interface. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: convert journal replay ptrs to darray

9835aa3

Eliminates some error paths - no longer have a hardcoded BCH_REPLICAS_MAX limit. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: fix split brain message

f7a189a

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: improve journal entry read fsck error messages

ff5631e

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: jset_entry_datetime

9a5bcbf

This gives us a way to record the date and time every journal entry was written - useful for debugging. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: bio per journal buf

eba16ff

Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: closure per journal buf

01d9103

Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: btree_and_journal_iter.trans

35bbaef

we now always have a btree_trans when using a btree_and_journal_iter; prep work for adding prefetching to btree_and_journal_iter Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Switch to uuid_to_fsid()

77c1a2f

switch the statfs code from something horrible and open coded to the more standard uuid_to_fsid() Signed-off-by: Kent Overstreet <[email protected]>

koverstreet force-pushed the master branch 3 times, most recently from e4b827e to b3a7824 Compare November 4, 2024 07:06

koverstreet force-pushed the master branch 3 times, most recently from 48c34ec to c7e3dd8 Compare November 15, 2024 01:20

koverstreet force-pushed the master branch 7 times, most recently from 8903f5f to bc01863 Compare November 30, 2024 02:27

koverstreet force-pushed the master branch 8 times, most recently from f1d736e to f22a971 Compare December 8, 2024 05:20

koverstreet force-pushed the master branch 6 times, most recently from 79c1539 to f8ed207 Compare December 12, 2024 07:29

koverstreet force-pushed the master branch 3 times, most recently from 379189f to c684780 Compare December 25, 2024 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Change zstd compression levels mapping #643

[Proposal] Change zstd compression levels mapping #643

LRFLEW commented Jan 28, 2024 •

edited

Loading

boomshroom commented Jan 28, 2024

LRFLEW commented Jan 29, 2024

[Proposal] Change zstd compression levels mapping #643

Are you sure you want to change the base?

[Proposal] Change zstd compression levels mapping #643

Conversation

LRFLEW commented Jan 28, 2024 • edited Loading

boomshroom commented Jan 28, 2024

LRFLEW commented Jan 29, 2024

LRFLEW commented Jan 28, 2024 •

edited

Loading