Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-132432 / 24.10.1 / Cherry-pick important zfs commits from upstream to stable/electriceel for 24.10.1 freeze #259

Merged
merged 25 commits into from
Nov 14, 2024

Conversation

ixhamza
Copy link

@ixhamza ixhamza commented Nov 14, 2024

Motivation and Context

Since the 24.10.0 freeze, several important fixes and improvements have been made in upstream ZFS. Only the most relevant changes have been cherry-picked after discussion with @amotin.
Jira Ticket: NAS-132432.

Description

The following commits have been cherry-picked from upstream ZFS:

  1. Fix an uninitialized data access: Fixes possible kernel data leak.
  2. Remove extra newline from spa_set_allocator(): Fixes zfs dbgmsg formatting.
  3. Avoid fault diagnosis if multiple vdevs have errors: Improves handling for controller errors.
  4. arcstat: add structural, types, states breakdown: Add more data to arcstat for performance analysis.
  5. zio_compress: introduce max size threshold: Earlier abort compression on incompressible data.
  6. ZLE compression: don't use BPE_PAYLOAD_SIZE: Fix to the previous commit.
  7. arc_hdr_authenticate: make explicit error: Better data re-compression error handing for crypto authentication.
  8. Evicting too many bytes from MFU metadata: Improve ARC MRU/MFU balance.
  9. Properly release key in spa_keystore_dsl_key_hold_dd(): Fixes assertion related to encryption key handling.
  10. Restrict raidz faulted vdev count: Excludes a replacing vdev child from DTL fault assessments.
  11. ARC: Cache arc_c value during arc_evict(): Further improve ARC MRU/MFU balance.
  12. Fix generation of kernel uevents for snapshot rename on linux: Resolves name generation in /dev tree when snapshot is renamed.
  13. zpool/zfs: allow --json wherever -j is allowed: CLI improvement.
  14. Pack dmu_buf_impl_t by 16 bytes: Reduces ARC memory consumption in small block workloads.
  15. On the first vdev open ignore impossible ashift hints: Prevents use of ashift values impossible for the device.
  16. vdev_disk: try harder to ensure IO alignment rules: Enforces strict I/O alignment for compatibility with dm-crypt.
  17. vdev_disk: move abd return and free off the interrupt handler: Avoid freeing abd buffer in the interrupt handler.
  18. Added output to zpool online and offline: Adds console output if zpool online/zpool offline fail.
  19. Verify parent_dev before calling udev_device_get_sysattr_value: Prevents potential assertions in zed.
  20. Use simple folio migration function: Resolves a bug in Linux page migration introduced from kernel 6.3.0.
  21. JSON: fix user properties output for zfs list: See NAS-132143.
  22. JSON: fix user properties output for zpool list: See NAS-132349.
  23. Fix user properties output for zpool list: See NAS-132349.
  24. L2ARC: Move different stats updates earlier: Fixes possible assertions in L2ARC space accounting.
  25. zvol_os.c: Increase optimal IO size: See NAS-132497.

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

asomers and others added 25 commits November 11, 2024 22:29
zfs_acl_node_alloc allocates an uninitialized data buffer, but upstack
zfs_acl_chmod only partially initializes it.  KMSAN reported that this
memory remained uninitialized at the point when it was read by
lzjb_compress, which suggests a possible kernel memory disclosure bug.

The full KMSAN warning may be found in the PR.
openzfs#16511

Signed-off-by:	Alan Somers <[email protected]>
Sponsored by:	Axcient
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
zfs_dbgmsg() does not need newline at the end of the message.

While there, slightly update/sync FreeBSD __dprintf().

Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16536
When multiple drives are throwing errors, it is likely not
a drive failing but rather a failure above the drives, like
a controller.  The active cases context of the drive's peers
is now considered when making a diagnosis.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes openzfs#16531
Add ARC structural breakdown, ARC types breakdown, ARC states
breakdown similar to arc_summary.  Additional cleanups included.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Theera K. <[email protected]>
Closes openzfs#16509
Now default compression is lz4, which can stop
compression process by itself on incompressible data.
If there are additional size checks -
we will only make our compressratio worse.

New usable compression thresholds are:
- less than BPE_PAYLOAD_SIZE (embedded_data feature);
- at least one saved sector.

Old 12.5% threshold is left to minimize affect
on existing user expectations of CPU utilization.

If data wasn't compressed - it will be saved as
ZIO_COMPRESS_OFF, so if we really need to recompress
data without ashift info and check anything -
we can just compress it with zero threshold.
So, we don't need a new feature flag here!

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes openzfs#9416
ZLE compressor needs additional bytes to process
d_len argument efficiently.
Don't use BPE_PAYLOAD_SIZE as d_len with it
before we rework zle compressor somehow.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes openzfs#9416
On compression we could be more explicit here for cases
where we can not recompress the data.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Co-authored-by: Alexander Motin <[email protected]>
Signed-off-by: George Melikov <[email protected]>
Closes openzfs#9416
Without updating 'm' we evict from MFU metadata all that we wanted
to evict from all metadata, including already evicted MRU metadata
('m' is the total amount of metadata we had at the beginning,
and 'w' is the total amount of metadata we want to have). 

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Theera K. <[email protected]>
Closes openzfs#16521
Closes openzfs#16546
Since dsl_crypto_key_open() references the key, 0d23f5e should
have called dsl_crypto_key_rele() to drop it first instead of
calling dsl_crypto_key_free() directly.  The final result should
actually be the same, but without triggering dck_holds assertion.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16567
Specifically, a child in a replacing vdev won't count when assessing
the dtl during a vdev_fault()

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes openzfs#16569
Since arc_evict() run can take some time, arc_c change during it
may result in undesired shift in ARC states balance. Primarily in
case of arc_c reduction it may cause eviction from MFU data state
despite its being below the target already.  Instead we should
evict as originally planned and if needed do another round after.

Reviewed-by: Theera K. <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16576
Closes openzfs#16605
`zvol_rename_minors()` needs to be given the full path not just the
snapshot name.  Use code removed in a0bd735 as a guide
to providing the necessary values.

Add ZTS check for /dev changes after snapshot rename.  After
renaming a snapshot with 'snapdev=visible' ensure that the /dev
entries are updated to reflect the rename.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: James Dingwall <[email protected]>
Closes openzfs#14223 
Closes openzfs#16600
Mostly so that with the JSON formatting options are also used, they all
look the same. To my eye, `-j --json-flat-vdevs` suggests that they are
different or unrelated, while `--json --json-flat-vdevs` invites no
further questions.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Umer Saleem <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16632
On 64bit FreeBSD this reduces one from 296 to 280 bytes.  On small
block workloads dbufs may consume gigabytes of ARC, and this saves
5% of it.

Reviewed-by: Tino Reichardt <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16684
If on the first open device's logical ashift is bigger than set
by pool's ashift property, ignore the last as unusable instead of
creating vdev that will fail most of I/Os due to misalignment.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ameer Hamza <[email protected]>
Signed-off-by:  Alexander Motin <[email protected]>
Sponsored by:   iXsystems, Inc.
Closes openzfs#16690
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
Freeing an ABD can take sleeping locks to update various stats. We
aren't allowed to sleep on an interrupt handler. So, move the free off
to the io_done callback.

We should never have been freeing things in the interrupt handler, but
we got away with it because we were usually freeing a linear ABD, which
at most is returning two objects to a cache and never sleeping. Scatter
ABDs can be used now, and those have more complex locking.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16687
I was surprised to discover today that `zpool online` and
`zpool offline` don't print any information about why they failed in
many cases, they just return 1 with no information about why.

Let's improve that where we can without changing the library function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Allan Jude <[email protected]>
Signed-off-by: Rich Ercolani <[email protected]>
Closes openzfs#16244
Not all udev devices have parent devices.
Calling udev_device_get_ functions yield an assertion error
if called with a NULL pointer.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Sietse <[email protected]>
Co-authored-by: Sietse <[email protected]>
Closes openzfs#16705 
Closes openzfs#16717
Avoids using fallback_migrate_folio, which starts unnecessary writeback
(leading to BUG in migrate_folio_extra).

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: tstabrawa <[email protected]>
Closes openzfs#16568
Closes openzfs#16723
This commit fixes JSON output for zfs list when user properties are
requested with -o flag. This case needed to be handled specifically
since zfs_prop_to_name does not return property name for user
properties, instead it is stored in pl->pl_user_prop.

Reviewed-by: Ameer Hamza <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Umer Saleem <[email protected]>
Closes openzfs#16732
This commit fixes JSON output for zpool list when user properties are
requested with -o flag. This case needed to be handled specifically
since zpool_prop_to_name does not return property name for user
properties, instead it is stored in pl->pl_user_prop.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Umer Saleem <[email protected]>
Closes openzfs#16734
In zpool_get_user_prop, when called from zpool_expand_proplist and
collect_pool, we often have zpool_props present in zpool_handle_t equal
to NULL. This mostly happens when only one user property is requested
using zpool list -o <user_property>. Checking for this case and
correctly initializing the zpool_props field in zpool_handle_t fixes
this issue.

Interestingly, this issue does not occur if we query any other property
like name or guid along with a user property with -o flag because while
accessing properties like guid, zpool_prop_get_int is called which
checks for this case specifically and calls zpool_get_all_props.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Umer Saleem <[email protected]>
Closes openzfs#16734
..., before we make the header or the log block visible to others.
It should fix assertion on allocated space going negative if the
header is freed once the lock is dropped, while the write is still
going.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#16040
Closes openzfs#16743
Since zvol read and write can process up to (DMU_MAX_ACCESS / 2) bytes
in a single operation, the current optimal I/O size is too low. SCST
directly reports this value as the optimal transfer length for the
target SCSI device. Increasing it from the previous volblocksize results
in performance improvement for large block parallel I/O workloads.

Signed-off-by: Ameer Hamza <[email protected]>
@bugclerk bugclerk changed the title Cherry-pick important zfs commits from upstream to stable/electriceel for 24.10.1 freeze NAS-132432 / 24.10.1 / Cherry-pick important zfs commits from upstream to stable/electriceel for 24.10.1 freeze Nov 14, 2024
Copy link
Collaborator

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've edited the commit notes in the PR.

@amotin amotin merged commit 51254f1 into stable/electriceel Nov 14, 2024
15 of 16 checks passed
@amotin amotin deleted the NAS-132432 branch November 14, 2024 14:35
@bugclerk
Copy link

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

@truenas truenas locked as resolved and limited conversation to collaborators Nov 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.