Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix alignment issue of TensorData bytes #2416

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

WorldSEnder
Copy link
Contributor

@WorldSEnder WorldSEnder commented Oct 24, 2024

Checklist

  • Confirmed that run-checks all script has been executed.
    I have some issues with flaky tests, in particular quantize_dynamic_int8 and downsample_interpolation are off-by-1 in a few cases. I've ignored them locally, they seem to be fine in CI.
    Further, I've stumbled over a test failure seemingly related to my AMD GPU driver

ACO ERROR:
In file ../mesa-24.1.4/src/amd/compiler/aco_assembler.cpp:1056
Unsupported opcode: v4: %370:v[8-11] = v_wmma_f32_16x16x16_f16 (latekill)%369:v[36-43].xx, (latekill)%359:v[48-55].xx, %368:v[8-11].xx

  • Made sure the book is up-to-date with this PR. Not applicable, changes to internals only.

Related Issues/PRs

Fixes #2375.

Changes

Add a new opaque structure Bytes which holds on to tensor data. As explained in the linked issue, using a Vec<u8> is incorrect, as this will deallocate the memory with an incorrect alignment/memory layout given to the allocator, which violates the allocator contract. Instead, the structure remembers the alignment of the data and uses that when deallocating.

For serialization, only the bytes are written and no alignment is taken into account. Instead, when deserializing, the data is allocated with over-alignment (currently align_of::<u128>) which should be sufficient to interpret a slice of the data as a slice of some larger type, such as f32 etc.

Note that this means that serialization is not strictly a round-trip, which can show in the difference between how slices and Vec works:

  • to re-interpret a [u8] as a [E], the slice needs to be aligned at least as strictly as the alignment of E.
  • to re-interpet a Vec<A> as a Vec<B>, the alignments must be exactly equal. This means that a deserialized instance of Bytes possibly can not be converted via Bytes::try_into_vec. In case this is needed in the future, one would either need to get access to the needed alignment during deserialization or copy the data to a correctly aligned buffer. Such an API is possible but not included in this PR.

Testing

Some limited tests of TensorData has been run through miri manually to confirm that it doesn't complain after the patch. In case you want tests in the repo, I would need more guidance how to set that up, since miri should probably only run a limited number of tests for CI performance reasons.

introduce max alignment (which depends on platform anyway) and dont serialize that part
fixes Clone, Debug, and Eq impls to work on the bytes, not the pointers.
Copy link

codecov bot commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 84.31373% with 64 lines in your changes missing coverage. Please review.

Project coverage is 82.91%. Comparing base (fb05a55) to head (94724b6).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
crates/burn-tensor/src/tensor/bytes.rs 81.68% 63 Missing ⚠️
crates/burn-core/src/record/serde/data.rs 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2416      +/-   ##
==========================================
- Coverage   82.94%   82.91%   -0.04%     
==========================================
  Files         814      816       +2     
  Lines      105306   105735     +429     
==========================================
+ Hits        87350    87674     +324     
- Misses      17956    18061     +105     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@WorldSEnder
Copy link
Contributor Author

Re Codecov report:

The missed lines are for the most part the Debug impl of Bytes. I think it makes sense to invest a bit more time into nice formatting there, any comments on what the output should be would be appreciated.

@laggui
Copy link
Member

laggui commented Nov 4, 2024

Sorry for the delayed response! We discussed this offline but forgot to get back to you.

First, thanks for identifying the potential issue.

While these changes to handle alignment concerns more generically could certainly provide better safety, it also adds complexity and might not be immediately necessary. That's why we're a bit hesitant on making this change for now. As you noticed this is not an issue for most allocators.

Of course that doesn't mean it is not an issue, just that it does not affect the majority of use cases. So we should still keep this in mind, and we might circle back in the future thanks to your investigation.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 4, 2024

While I understand the added complexity makes it difficult to review, I am a bit disappointed. There is no way around the fact that the currently implementation is misusing unsafe. miri rightfully complains about it. Yes, on the major platforms I have on my hand (linux and windows) the allocator plays along, yet burn-tensor is no_std. I do not have a ras-pi on my hands to test this, but the burn book mentions the embedded-alloc crate and that already seems much more fickle about the alloc requirements. After all, this is rust but I understand that you don't have the time to properly review this, with the ongoing work on cubecl and all. Don't read this as me trying to rush this in, I would also not be happy to see this merged without a proper review.

I only noticed this when I implemented the deserialization logic, but please note that the current impl will also happily allocate a Vec<u8> aligned to "just" a u8 even for element types of f32 or whatever you element type one has that will surprisingly and inconveniently fail to cast to a slice of its own element type for example here.

In any case, I suppose there is no problem with just leaving this open for the time being.

Just one more question: Would you be less wary if the new unsafe code was in a tested outside crate such as bytemuck? Some parts such as the (de)serialization I think don't make sense to be outside burn, but most of the "byte container" could possibly make more sense upstream.

@laggui
Copy link
Member

laggui commented Nov 4, 2024

Don't read this as me trying to rush this in, I would also not be happy to see this merged without a proper review.

No worries, we're on the same page here 🙂

I think our main concerns were more in terms of trade-offs, mostly time and possible ramifications to this breaking change (though I think my position has changed on that.. as you'll see below).

burn-tensor is no_std. I do not have a ras-pi on my hands to test this, but the burn book mentions the embedded-alloc crate and that already seems much more fickle about the alloc requirements

I'm glad you brought this up! Honestly I think I kind of forgot to take that into consideration.. this should be especially important for Burn as we aim for portability. I think our previous position was that the current implementation was not an issue for most modern processors and platforms, but that's not good enough.

I should really get a ras-pi myself and start messing around with Burn on no_std so we can improve this.

Just one more question: Would you be less wary if the new unsafe code was in a tested outside crate such as bytemuck? Some parts such as the (de)serialization I think don't make sense to be outside burn, but most of the "byte container" could possibly make more sense upstream.

I'm not too wary about this tbh.

I gotta run but wanted to make sure to get back to you with my current thoughts.

In short, in my previous comment I was opened to the change but did not think it was required for the short term. But now I think we should make sure that this is addressed (ideally, without performance impacts to de/serialization - so we should make sure to benchmark that for some practical use cases).

@WorldSEnder
Copy link
Contributor Author

I gotta run but wanted to make sure to get back to you with my current thoughts.

In short, in my previous comment I was opened to the change but did not think it was required for the short term.

No worries, I too think we are on the same page :) While strictly speaking violating the allocator contract like atm could lead to memory unsafety, I don't think any allocator actually does use the violated guarantees in a meaningful way (but I'm open to surprises). On the other hand, misaligned allocations would be a real concern but at least as far as I can tell mostly lead to panics, not undefined behaviour or violations of memory safety, so a relaxed approach to fixing this is fine 👍 Sorry if I came off as a bit panicky

@laggui
Copy link
Member

laggui commented Nov 8, 2024

@WorldSEnder I was looking at your PR yesterday and I think we'd like to stay away from a new opaque type for now.

I also benchmarked some solutions to preserve alignment and none of them come close to the current pointer re-interpret sadly.

I think we can keep both approaches instead based on some target architectures for now, since we know that heap allocations are overaligned on mainstream systems. And for others with stricter alignment rules, we fallback to something similar to what you implemented in the Bytes opaque type. We also add some documentation for anyone that might come across the code in the future 🙂

    /// Initializes a new tensor data structure from the provided values.
    fn init<E: Element, S: Into<Vec<usize>>>(
        mut value: Vec<E>,
        shape: S,
        dtype: DType,
    ) -> Self {
        // Ensure shape is valid
        let shape = shape.into();
        let shape_numel = Self::numel(&shape);
        value.truncate(shape_numel);
        let numel = value.len();
        assert_eq!(
            shape_numel, numel,
            "Shape {:?} is invalid for input of size {:?}",
            shape, numel,
        );

        let bytes = Self::to_bytes(value);

        Self {
            bytes,
            shape,
            dtype,
        }
    }

    fn to_bytes<E: Element>(value: Vec<E>) -> Vec<u8> {
        // Fast path for architectures with heap allocation overalignment.
        #[cfg(all(
            not(miri),
            any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")
        ))]
        {
            let mut value = value;
            let factor = std::mem::size_of::<E>() / std::mem::size_of::<u8>();
            let len = value.len() * factor;
            let capacity = value.capacity() * factor;
            let ptr = value.as_mut_ptr();

            std::mem::forget(value);

            // SAFETY: Practically, mainstream systems align heap allocations beyond
            // the minimum required alignment for the element type (e.g., 8 bytes on 64-bit
            // systems).
            //
            // This is not guaranteed by Rust's memory model, which follows strict alignment
            // requirements as specified by the type's layout.
            //
            // But we leverage the overalignment and efficient unaligned access to interpret
            // the same memory layout as individual bytes instead of `E` elements.
            // This simple trick makes this pointer cast acceptable (and requires no copy).
            unsafe { Vec::from_raw_parts(ptr as *mut u8, len, capacity) }
        }

        // Fallback for architectures with stricter alignment rules.
        #[cfg(any(
            miri,
            not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))
        ))]
        {
            let num_bytes = value.len() * std::mem::size_of::<E>();
            let mut bytes = vec![0u8; num_bytes];
            unsafe {
                std::ptr::copy_nonoverlapping(
                    value.as_ptr() as *const u8,
                    bytes.as_mut_ptr(),
                    num_bytes,
                );
            }
            bytes
        }
    }

I also added miri to the conditions to make sure it doesn't complain 😅 let me know how you feel about this.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 8, 2024

@laggui I think in the long run, an opaque wrapper should happen at some point (note the wrapper in this PR does no mem-copy so should not have the overhead you mention). A publicly exposed field of Vec<u8> lets the user mem::swap in any Vec<u8> that they may have constructed themselves. I'd want to double check if that entails that the storage has to be aligned if the system allocator always aligns it, my current assumption would be that it entails more weirdness that an opaque wrapper struct would cover.

Secondly, the target detection is perhaps 90% there. Sadly, I don't think there is a way to detect the specific allocator being used, and the check is of course only detecting the target_*, not whether System is being used as the #[global_allocator] and as far as I know there is no way to detect which allocator ends up being used in the end as the forwarding Global does not expose this.
Note for future reference: It seems the choice of default system allocator is made in the std in here if we want to go the route of just using system information as a proxy.
Another comment would be to just let miri complain. This would still be technically violate the allocator contract and since there is no way to check which allocator is used, I would still see it as unsound. So let miri panic and don't paint over the wound.

What exactly is the concern with an opaque struct? It seems there would be a simple almost-zero-cost conversion from Vec<u8> to TensorData, a Deref<Target=[u8]> impl and reasonably efficient conversion back to Vec<u8> (though that might entail a mem-copy or the downside of possibly-undefined-behaviour-maybe-just-panicking.

PS: Did you encounter any slowdown with this PR in your benchmarking? If so, this would be a concern and I would put in more work to make sure I didn't miss any unreasonable overhead.

@laggui
Copy link
Member

laggui commented Nov 11, 2024

A publicly exposed field of Vec lets the user mem::swap in any Vec that they may have constructed themselves. I'd want to double check if that entails that the storage has to be aligned if the system allocator always aligns it, my current assumption would be that it entails more weirdness that an opaque wrapper struct would cover.

Yeah I think we could (should) remove the public field and users could only get the slice with the current available methods. So no direct access to the underlying bytes. Not sure why we left it public actually 😅

Secondly, the target detection is perhaps 90% there. Sadly, I don't think there is a way to detect the specific allocator being used, and the check is of course only detecting the target_*, not whether System is being used as the #[global_allocator] and as far as I know there is no way to detect which allocator ends up being used in the end as the forwarding Global does not expose this.

That's what I found as well. Was trying to cover the most important cases with this condition where stricter alignment requirements are necessary, but I realize this is not 100%. I wasn't sure if a new type was warranted at this time but I now realize that it might be required to do this cleanly.

Another comment would be to just let miri complain. [...] So let miri panic and don't paint over the wound.

Oh I agree! This was mostly for the current tests with miri 😅

What exactly is the concern with an opaque struct?

An opaque struct like Bytes is a nice to encapsulate and control the alignment details, but it introduces a breaking change with potential overhead (and complexity, though in its current form it is fairly simple and contained). I think my main concern is/was with potential redundancy since the alignment provided by Vec<u8> is sufficient on common platforms with Rust's default allocator. I thought we might be able keep away from the new opaque struct to avoid the breaking change and deal with the alignment requirements when necessary, though I'm not sure if this is feasible for 100% coverage (and I am not entirely informed on the different allocators). I'm bouncing back and forth between the two positions a bit (as with any trade-off, thanks for the constructive feedback / discussion in this thread btw!)

Also tiny note on the current Bytes implementation: what about sending between threads? Might need to be more careful (as opposed to Vec<u8>).

PS: Did you encounter any slowdown with this PR in your benchmarking? If so, this would be a concern and I would put in more work to make sure I didn't miss any unreasonable overhead.

Ran some benchmarks for TensorData creation from existing floating point data (32 x 512 x 1024) and deserialization (12 linear layers with shape [2048, 2048]) on both the current state and with your changes.

The results seem to be on par with the current, which is great 🙏 I think I recalled incorrectly that there was a copy in from_elems without looking back at the code, so I was expecting a slowdown. But that's not the case (it's in another path).

Current

Benchmark Feature Backend Device Median
from_data ndarray ndarray Cpu 95.000µs
load_record_lazy ndarray ndarray Cpu 105.000µs
load_record_manual ndarray ndarray Cpu 0.000ns
load_record_sync ndarray ndarray Cpu 253.399ms

This PR

Benchmark Feature Backend Device Median
from_data ndarray ndarray Cpu 96.000µs
load_record_lazy ndarray ndarray Cpu 112.000µs
load_record_manual ndarray ndarray Cpu 0.000ns
load_record_sync ndarray ndarray Cpu 269.023ms

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 11, 2024

Yeah I think we could (should) remove the public field and users could only get the slice with the current available methods. So no direct access to the underlying bytes. Not sure why we left it public actually 😅

I was wondering if it was intentional that you could push further bytes after construction, cause that seems a bit weird since you could push e.g. a single u8 even when the TensorData contains u16 etc. Good to hear that this was never intended, thought I missed something :D

Was trying to cover the most important cases with this condition where stricter alignment requirements are necessary, but I realize this is not 100%.

After more investigation, and a bit longer look than I had wished, I found that the default allocator on windows actually does use the passed align in a non-trivial way. Specifically, data below MIN_ALIGN is just allocated as-is, but anything beyond that will allocate additional space and contain a header with a pointer to the actual allocation, pictogram:

x ... unused bytes
u ... user data

 v --- aligned to MIN_ALIGN, this pointer is allocated from HeapAlloc
 | x x x x Header(ptr) | u u u u u u u ..... |
 ^----------------^-^
                       ^  aligned to requested alignment, this pointer is returned to user-code

The reason the current code gets away with it - supposedly - is because on 64-bit platforms MIN_ALIGN is 16, which is larger than the data types that currently get put into a tensor. But it's awefully brittle and it just feels like corner-case after corner-case to find these kind of "exceptions" to the allocator playing nicely along to wrong alignment.

it introduces a breaking change

Yeah, the timing close to the release of 0.15 is a bit unfortunate. Not sure about your planned release cycles, I wouldn't mind holding it until the breaking 0.16 comes along naturally.

Also tiny note on the current Bytes implementation: what about sending between threads?

Thread-safety is indeed a good point. The current Bytes should be Send + Sync without a problem since Element: Send + Sync and &Bytes offering only read access, but that could be encoded in some static asserts and tests.

The results seem to be on par with the current, which is great 🙏

There seems to be a slight slowdown in deserialization, which might perform an extra copy to an aligned buffer. I think in that case we can get away with actually checking the runtime alignment instead (still storing and de-allocating with the alignment of the actual buffer) and not copying/only copying when someone wants to convert it an actual Vec<E>. I will take another look 👍

this already improves readability a bit by separating out alloc/dealloc logic
and adding a bunch of safety comments and better error messages
borrowing from the deserializer will not save a copy, and is moreover
inefficient when we could take ownership of an existing byte buffer
both changes only target miri's borrowing semantics, oprationally
the pointers are the same, but they obey different borrow-stack rules.
@WorldSEnder
Copy link
Contributor Author

Alright, have added both suggestions (Sync and Send + avoiding a copy if the deserialized vector ends up being aligned). I have added further tests and these can be run under miri. In case you want to do that yourself, try it with

cargo +nightly miri test --package burn-tensor --lib -- tensor::bytes::tests --show-output

@WorldSEnder
Copy link
Contributor Author

@laggui I wanted to merge the changes from quantization and in the process came up with a few questions where I'm not sure what's intended and what is not, again about serialization.

There should be no difference in this PR and the state on main, but (de)serialization seems to assume that data in storage is in system byte order. I don't see any conversion from/to network byte order. So serializing a tensor on a little endian (LE) system such as x86, then deserializing on a BE system (well mostly SPARC; but fairly uncommon I think) would shuffle the bytes and not recover the data as intended. Further network order is big endian (which is lame, as most systems have to perform a byte shuffle for what are merely historic reasons). So the question is this: Since serialization is performed as a byte sequence where endianess matters, is the current state just assuming little endian everywhere and not correcting for it correct?

If the answer to the above is yes, and there is no external constraints on the byte order, I don't see why the byte packing (which does seem to pad with 0s, then reinterpret as a Vec of big endian u32) could not just use the system endianess and pad then reinterpret without endianess conversion and do so in place. The current implementation seems to create 2 extra Vecs. One in pack_i8s_to_u32s, then another in to_vec().

In any case, I will implement a fn extend_from_slice for Bytes to accomodate storing the quantization strategy there.

A lot of methods seem to ignore the difference between values_as_bytes and as_bytes and try to make use of the additional quantization data, too (such as iter, to_vec). Is that intentional?

@laggui
Copy link
Member

laggui commented Nov 16, 2024

@laggui I wanted to merge the changes from quantization and in the process came up with a few questions where I'm not sure what's intended and what is not, again about serialization.

There should be no difference in this PR and the state on main, but (de)serialization seems to assume that data in storage is in system byte order. I don't see any conversion from/to network byte order. So serializing a tensor on a little endian (LE) system such as x86, then deserializing on a BE system (well mostly SPARC; but fairly uncommon I think) would shuffle the bytes and not recover the data as intended. Further network order is big endian (which is lame, as most systems have to perform a byte shuffle for what are merely historic reasons). So the question is this: Since serialization is performed as a byte sequence where endianess matters, is the current state just assuming little endian everywhere and not correcting for it correct?

Yep, we don't check for endianness atm so we assume everything is fine for now. Might have to check in the future.

If the answer to the above is yes, and there is no external constraints on the byte order, I don't see why the byte packing (which does seem to pad with 0s, then reinterpret as a Vec of big endian u32) could not just use the system endianess and pad then reinterpret without endianess conversion and do so in place. The current implementation seems to create 2 extra Vecs. One in pack_i8s_to_u32s, then another in to_vec().

Funny you bring this up, I realized the other day while making some changes that the "packing" order might not be as optimal. Quantization is still WIP but we were just about to refactor the quantization kernels in jit along with the representation. Learning some things as we go :)

In any case, I will implement a fn extend_from_slice for Bytes to accomodate storing the quantization strategy there.

A lot of methods seem to ignore the difference between values_as_bytes and as_bytes and try to make use of the additional quantization data, too (such as iter, to_vec). Is that intentional?

In iter we chose not to dequantize the values so it is intentional for the current behavior. Doesn't make as much sense in to_vec, we should change that.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 16, 2024

Yep, we don't check for endianness atm so we assume everything is fine for now. Might have to check in the future.

Ah okay! If you do in the future, that could be a good spot to move the assumption of a MAX_ALIGN to something more dtype dependent, as I would assume that converting byte order also needs to know about the underlying datatype, which could in one swoop also fix the runtime alignment (try_enforce_runtime_align in this PR).

@laggui
Copy link
Member

laggui commented Nov 18, 2024

I was wondering if it was intentional that you could push further bytes after construction, cause that seems a bit weird since you could push e.g. a single u8 even when the TensorData contains u16 etc.

I think there might be some use cases for that actually.. I discussed with @nathanielsimard last week and iirc this was one of the reasons for keeping it as Vec<u8> at the moment. Correct me if I'm wrong, but this was for some work in burn-remote? Can't remember exactly what you mentioned so maybe you could chime in here with your points.

@nathanielsimard
Copy link
Member

I don't recall a reason to push u8 into the vector.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 19, 2024

@syl20bnr I believe https://github.com/tracel-ai/github-actions/blob/b87057a7f68360d9d01a8da21de874c38e7acbab/setup-rust/action.yml#L23-L26 is accidentally run on windows machines as well, leading to tests failing there.

@syl20bnr
Copy link
Member

Argh, thank you, fixing it.

@syl20bnr
Copy link
Member

@WorldSEnder should be fixed, I restarted the job.

The new Allocation struct keeps the raw allocation and its layout,
the Bytes struct wraps an Allocation and asserts that len bytes of it are initialized
@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 19, 2024

Okay, implemented Bytes::extend_from_byte_slice to support appending quantization data. This means TensorData::quantized should use at least one less vec copy in most cases. Note that pack_i8s_to_u32s could probably be done in place, but that's left for another PR.

I realized that most methods that use the bytes, such as the aforementioned as[_mut]_slice and to_vec/into_vec refuse on the grounds of mismatching datatypes ("normal" vs quantized) before trying to touch the unrelated quantization data. Perhaps this could be a deliberate choice telling users to unquantize before being allowed to use these methods anyway.

As a side-effect of working Bytes into quantized, this now has a length check (unified with new), I suppose this was meant to be there in the first place. Let me know if I missed anything :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong usage of unsafe Vec::from_raw_parts in TensorData
4 participants