Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about: volatile, concurrency, and interaction with untrusted threads #152

Open
hsivonen opened this issue Jun 25, 2019 · 127 comments
Open
Labels
A-memory Topic: Related to memory accesses C-open-question Category: An open question that we should revisit Triaged Visited during a backlog bonanza meeting

Comments

@hsivonen
Copy link
Member

hsivonen commented Jun 25, 2019

Designate outside-of-program changes to memory accessed by volatile as non-UB

Context: An internals thread.

Use case: Rust is used to write privileged code (host services provided by a runtime environment to a JITed language, OS kernel providing syscall services to userland code, or a hypervisor providing emulated devices to a guest system) that needs to access memory that also unprivileged code can access and the unprivileged code can have multiple threads such that while unprivileged thread A has requested a service from the host such that the host service is running logically on A's thread of execution, a separate unprivileged thread of execution B could, if it is behaving badly, concurrently access the same memory from another CPU core. The unprivileged thread of execution must not be allowed to cause the privileged code written in Rust to experience UB. (It's fine for the unprivileged code to cause itself to experience UB within the bounds of its sandbox.)

The memory model itself is a whole-program model, so it doesn't apply, since in order to provide the guarantees it pledges to our thread, we must pledge the absence of data races from other threads of execution, which we can't do in this case. Hence, we need a way to access memory that is outside the memory model in the sense that there could exist an adversarial additional thread of execution that doesn't adhere to the DRF requirement. We're not trying to communicate with that thread of execution. The issue is just not letting it cause security bugs on us.

The C++ paper P1152R0 "Deprecating volatile" gives this use case as the very first item on its list of legitimate uses of volatile in C and C++. This makes sense, since if volatile works when external changes are caused by memory-mapped IO (the use case documented for std::ptr::read_volatile and the original use case motivating the existence of volatile in C), given the codegen for volatile and codegen for relaxed atomics on architectures presently supported by Rust, it makes sense for it to also work also when external changes are caused by a rogue thread of execution.

Yet, the documentation says: "a race between a read_volatile and any write operation to the same location is undefined behavior". I believe it's unnecessary and harmful to designate this as UB and it would be sufficient to merely say that the values returned by read_volatile are unpredictable in that case. This makes sense in the light of an IO-like view of volatile: You need to be prepared to receive any byte from an IO stream, so not knowing at compile time what you are going to get does not have to be program-destroying UB if you are prepared to receive value not predicted at compile time.

I suggest that a) the documentation be changed not to designate concurrent external modification of memory locations that a Rust program only accesses as volatile to be UB and b) to state in the Unsafe Code Guidelines that it's legitimate to use volatile accesses to access memory that a thread of execution external to the Rust program might change concurrently. That is, while you may read garbage, the optimizer won't assume that two volatile reads from the same location yield the same value and won't invent reads from memory locations written to using volatile writes (i.e. the memory locations are considered shared and, therefore, ineligible to be used as spill space by the compiler).

Replies to the thread linked to above indicate that this should already be the case despite the documentation suggesting otherwise.

Also see #152 (comment) which tries to summarize the discussion-until-then a bit.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 25, 2019

I don't think we need to distinguish between internal and external threads of execution. A thread of execution is just that, and the concurrent semantics of volatile should be specified for those in general.

Volatile loads and stores are not guaranteed to be atomic (and synchronize in any particular way), that is, using them to concurrently read/write to some memory could introduce a data-race.

Suppose you have an [u16] in shared memory, and two threads of execution modify it by writing either 0x00 or 0xff to it. If you use properly synchronized memory accesses, you are guaranteed to always read either 0x00 or 0xff from the array in any of the processes. If you use volatile loads / stores, one thread of execution could read 0xf0 or 0x0f. This can easily result in mis-optimizations even if the two threads of execution are run in different processes, e.g. if the program uses a type that properly expresses these semantics (e.g. a NonZero-like type), hint::assume annotations, etc.

Now, when people use volatile loads / stores to access IO registers, they exploit the platform-specific knowledge that the volatile load / store is atomic for that particular register or memory access size on that particular architecture.

For example, when reading a temperature from a sensor by using a volatile load on a 32-bit register, the hardware often guarantees atomicity for memory loads, making it impossible to observe a partially modified result. Here you can get away with a volatile load because the toolchain and the hardware conspire together to avoid a data-race.

We could document that whether volatile loads and stores are atomic and synchronize in any particular way for a particular memory addresses, memory access sizes, target, target features, target cpu, etc. is unspecified. That is, implementations are not required to document this behavior, and the volatile loads and stores are not guaranteed to be atomic. If misuse introduces a data-race, the behavior is undefined.

AFAICT, what we cannot do is guarantee that what you are trying to accomplish always works (for all Ts, on all targets, etc.).

@HadrienG2
Copy link

HadrienG2 commented Jun 25, 2019

I think one additional thing which @hsivonen needs is for data races to produce sane effects ("garbage data" and that's it) when the race occurs on a "bag of bits" data type which has neither padding nor invalid bit patterns, such as u8, u16...

It's a topic that comes up frequently in unsafe code discussions though, so perhaps there's already an UCG topic on this kind of matter that we can study for prior art.

@HadrienG2
Copy link

HadrienG2 commented Jun 25, 2019

A related issue, which is mostly of theoretical interest in multi-process scenarios as current implementations don't really have another option than to do the right thing, is that an adversarial thread writing invalid data into the shared memory space (e.g. mem::uninitialized::<u8>()) should not trigger UB in a privileged thread reading from it with proper precautions, given again a restriction to data types for which all bit patterns are valid.

@comex
Copy link

comex commented Jun 25, 2019

I think volatile should be documented and specified along these general lines:

  • Unlike almost everything else in the language, volatile defers to hardware semantics. volatile_load means "emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).

  • Accordingly, any properties involving atomicity, memory ordering, or handling of uninitialized data are guaranteed iff the underlying architecture guarantees them. Rust does not require its supported architectures to make any such guarantees; an architecture could, e.g., make racing volatile accesses undefined behavior, although no current Rust architectures do so. On the other hand, when compiling for an architecture that does provide guarantees, the compiler will not perform optimizations that break code that relies on those guarantees.

[What about miri?]

@comex
Copy link

comex commented Jun 25, 2019

(Sorry for double post -)

But I also think we should have a better answer for shared memory than volatile.

In particular, as discussed in the internals thread, we may want to guarantee that the UB caused by races between atomic and non-atomic accesses, if the accesses are in different processes, only affects the process performing the non-atomic access. In other words, you can safely use atomic accesses on shared memory even if your communication partner might be malicious – at least with regards to that particular source of UB. That seems like a reasonable guarantee.

On the other hand, there are other, more plausible ways that an architecture could hypothetically break this sort of IPC. For example, it could give each byte of memory an "initializedness" status, such that if process A writes uninitialized data to memory and process B reads it, process B gets an uninitialized value and traps if it tries to use it. (Note that Itanium does not do this; it tracks initializedness for registers, but not for memory.)

@HadrienG2
Copy link

HadrienG2 commented Jun 25, 2019

In an ideal world, there would be a way for the untrusting thread to state "I know that I might be reading uninitialized or otherwise fishy memory, please let me do so and return arbitrary bytes on incorrect usage".

Kind of like the ptr::freeze that was proposed there, but with transactional semantics to account for the racy nature of the situation, i.e. fn(*mut T) -> T.

However, I have no idea how to make that actually interact correctly with emulations of hardware that can track use of uninitialized memory but provide no way for software to announce voluntary access to uninitialized memory, such as valgrind.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 25, 2019

"emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).

I find that using "emit exactly one load instruction" to denote potentially many actual hardware load instructions confusing.

@comex
Copy link

comex commented Jun 25, 2019

Feel free to improve the wording :)

I mean of course that there is one load instruction per volatile_load call.

Though even that is slightly imprecise. The compiler can still inline or otherwise duplicate code, in which case a given binary could contain multiple locations corresponding to a single call to volatile_load in the source code, each of which would have its own load instruction.

Perhaps it's better to say: Any given execution of volatile_load will be performed by executing a single load instruction.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 25, 2019

@comex

I mean of course that there is one load instruction per volatile_load call.

This:

#[no_mangle] pub unsafe fn foo(x: *mut [u8; 4096]) -> [u8; 4096] { x.read_volatile() }

generates ~16000 instructions on godbolt. I don't know of any architecture in which this could actually be lowered to exactly one load instruction.

@petrochenkov
Copy link

[u8; 4096]

I assume @comex would expect this to be written as a loop.
If this model is followed, it would actually be nice to report a warning/error from the backend if the read_volatile/write_volatile cannot be lowered into a single load/store with the given size and alignment.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 25, 2019

@petrochenkov How would this be implemented?

No matter how I look at this, I see a lot of issues.

If we guarantee that (*mut T).read_volatile() either lowers to a single instruction or compilation fails, unsafe code will rely on that for safety, so this would need to be a guaranteed compilation error.

One issue is that we can only emit this error at monomorphization time. I'm not sure how we could fix that.

Another issue is that this would be a breaking change, but I suppose that we could either add new APIs and deprecate the old ones, or somehow emit this error only in newer editions (these operations are almost intrinsics).

I wonder how the implementation would look like. AFAIK only the LLVM target backend can know to how many instructions the load lowers to, and requiring this kind of cooperation from all LLVM backends (and Cranelift) seems unrealistic. I suppose we could generates "tests" during compilation, in which we count the instructions, but that seems brittle.

Then I wonder how this could work on Wasm SharedArrayBuffer. Even if we lower a volatile load to a single WASM instruction, the machine code generator might lower that into multiple loads depending on the host, and if that's the case, there is no way for it to report an error.

@comex
Copy link

comex commented Jun 25, 2019

@gnzlbg

Good point – volatile accesses of types that don't correspond to machine register types are somewhat ill-defined, AFAIK. But, e.g.

#[no_mangle] pub unsafe fn foo(x: *mut u64) -> u64 { x.read_volatile() }

should definitely be guaranteed to be a single load instruction on x86-64; same goes for smaller integer types.

Of course there's limited room to make decisions here since we're largely at the mercy of LLVM, but I'd say the rule is roughly "if the 'obvious' way to translate this load is with a single instruction, it has to be a single instruction".

Personally, I'd prefer if volatile produced a compile error in other cases, but that ship has sailed.

The rule gets less clear when SIMD gets involved. x86-64 can perform 128-bit loads via SIMD, and currently, a volatile_load of packed_simd::f32x4 does indeed expand to such an instruction:

	movaps	(%rdi), %xmm0

On the other hand, a volatile_load of u128 expands to two 64-bit regular loads:

	movq	(%rdi), %rax
	movq	8(%rdi), %rdx

I'd say this is defensible, because even if the architecture has some way to perform a 128-bit load, that's not the same as there being an 'obvious' way to load a 128-bit integer. On the other hand, with f32x4 we are explicitly requesting SIMD, so it's arguably reasonable to guarantee that it uses an appropriate SIMD instruction.

But in any case, it doesn't really matter whether a SIMD load is guaranteed, since movaps is not guaranteed to be atomic at an architectural level, so the difference between the two is not really observable.

@comex
Copy link

comex commented Jun 25, 2019

Then I wonder how this could work on Wasm SharedArrayBuffer. Even if we lower a volatile load to a single WASM instruction, the machine code generator might lower that into multiple loads depending on the host, and if that's the case, there is no way for it to report an error.

I'd say that the 'architecture' here is Wasm, not whatever it's ultimately compiled into. Just as x86-64 has 128-bit load instructions that aren't guaranteed to be atomic, Wasm apparently doesn't guarantee that loads of any size are atomic, unless you use the special atomic instructions. But that's fine; it's already established that the set of guarantees provided depends on the architecture.

However, Wasm is arguably a motivating use case for exposing additional intrinsics for loads and stores marked both atomic and volatile (which LLVM supports).

@petrochenkov
Copy link

petrochenkov commented Jun 25, 2019

AFAIK only the LLVM target backend can know to how many instructions the load lowers to, and requiring this kind of cooperation from all LLVM backends (and Cranelift) seems unrealistic.

That's exactly what I meant by "from the backend" - a target-specific LLVM backend, before that point no one knows about instruction specifics.
I don't know whether LLVM has necessary infrastructure or not, I'm just saying that having it would be useful.
(This certainly shouldn't be any kind of language-level guarantee, only a lint-like diagnostic, even if it's highly reliable.)

@comex
Copy link

comex commented Jun 26, 2019

I'm a little confused what the argument is about, but to be clear – for MMIO to work correctly, volatile loads and stores of basic integer types (specifically, ones that can fit into a register) must be done using a single load/store instruction of the correct width. So the behavior with those types needs to be a language-level guarantee. That should probably also apply to #[repr(transparent)] wrappers around those types.

The behavior of volatile with other types, as I said, is relatively ill-defined. In most cases it does seem fairly likely to be a mistake – e.g. if the [u8; 4096] example were in real code, it would be unlikely that the author actually meant to generate that giant function that copies the bytes one-by-one. Or if, say, someone used a struct to represent a block of MMIO registers, they might accidentally write my_struct.volatile_load().field instead of (&raw my_struct.field).volatile_load(). So it might make sense to produce some sort of diagnostic. On the other hand, I could imagine volatile accesses with large types being done intentionally for the shared-memory use case.

@Lokathor
Copy link
Contributor

(Speaking as one of the two "rust on the GBA" devs) Integer types: yes. Transparent types: absolutely must also be yes. For MMIO to be an approachable issue you have to be able to newtype all the integers involved.

@Centril
Copy link
Contributor

Centril commented Jun 26, 2019

  • Unlike almost everything else in the language, volatile defers to hardware semantics. volatile_load means "emit exactly one load instruction which cannot be removed or reordered, and don't make assumptions about the loaded value", but the meaning of a "load instruction" is implementation-defined (and in practice architecture-dependent).

It seems to me that if you want to provide functionality that is specified as deferring to hardware semantics, and which is inherently architecture-dependent, then it is better to provide these through an explicitly architecture-dependent mechanism instead of calling it "implementation defined". For this, we have the core::arch module. If a few architectures provide the same guarantee for the same family of functions, then an interface can be built over that in a third-party crate.

@hsivonen
Copy link
Member Author

Volatile loads and stores are not guaranteed to be atomic (and synchronize in any particular way), that is, using them to concurrently read/write to some memory could introduce a data-race.

It's unclear to me if you are using "atomic" and "data-race" colloquially or as specific memory model terms. For my use case, I don't need colloquial atomic: that is, I don't need indivisibility. In particular, I want to do lockless unaligned SIMD loads/stores, and I don't care if they tear in the presence of an adversarial second thread. However, I need "atomic" in the memory model sense that colloquially there is a data race but it must not be a "data race" for the purpose of "data races are UB": Just like relaxed atomics race in practice but that race is defined not to constitute a "data race" for the purpose of "data races are UB".

Suppose you have an [u16] in shared memory, and two threads of execution modify it by writing either 0x00 or 0xff to it. If you use properly synchronized memory accesses, you are guaranteed to always read either 0x00 or 0xff from the array in any of the processes. If you use volatile loads / stores, one thread of execution could read 0xf0 or 0x0f.

This is fine for my use case. My use case needs to read or write sensible values only in the single-threaded scenario. If there's another thread, the other thread is an error on the part of the unprivileged code and adversarial from the point of view of the privileged code, at which point I'm fine with the unprivileged code UBing itself and getting garbage results from the host service. I don't want it to UB the privileged host service code: the privileged code may see garbage values that are even inconsistent garbage between two reads from the same memory location, but there must not be optimizations that would introduce security bugs that were not in the source code if source code had no security bugs if every load behaved like a call to a random number generator. (An example of a compiler-introduced security bug would be an elision bound checks on the assumption that two loads from the same memory location yield consistent values.)

This can easily result in mis-optimizations even if the two threads of execution are run in different processes, e.g. if the program uses a type that properly expresses these semantics (e.g. a NonZero-like type), hint::assume annotations, etc.

For clarity, I only intend to read types for which all bit patterns are valid values (and only on architectures that cannot track "uninitialized" bytes in RAM and yield some bit patterns for memory locations that are uninitialized for the purpose of high-level language semantics), and the thing I'm asking for is being able to turn off optimizations that could be dangerous in the presence of an adversarial other thread. AFAICT, this means that 1) the compiler must not use memory locations that the get written as spill space (i.e. must not invent reads that expect to read back previous writes intact) and 2) if the compiler generates two loads from the same memory location (either on its own initiative or because the source code showed two loads), the compiler must not assume that the two loads yield mutually-consistent values (i.e. the compiler must not optimize on the assumption that values read from the same memory location are mutually consistent).

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 26, 2019

@comex

I'm a little confused what the argument is about, but to be clear – for MMIO to work correctly, volatile loads and stores of basic integer types (specifically, ones that can fit into a register

Volatile reads and writes are generic over T so while we could guarantee more things for concrete Ts, the semantics being argued about in the OP are the generic ones that hold for all T.

Also, you just showed that even though x86 has 128-bit registers with atomic instructions, volatile reads and writes to u128 are not atomic.

So we can't say that "If T is an integer and it fits in a register in the target, volatile reads / writes are atomic".

In a 32-bit architecture, reads / writes to 64-bit integers might not be atomic, in a 16-bit architecture, reads / writes to 32-bit integers might not be atomic, etc. At best, because we only support platforms with CHAR_BITS == 8, we might be able to guarantee that 8-bit volatile reads/writes are atomic and relaxed everywhere.

@petrochenkov

(This certainly shouldn't be any kind of language-level guarantee, only a lint-like diagnostic, even if it's highly reliable.)

The unsafe code guidelines specify what guarantees is unsafe code allowed to rely on. That is, if you write generic unsafe code, can it rely on volatile reads and writes of T being atomic and relaxed ? What if T = u64 ? AFAIK the answer to both question is "No, unsafe code cannot rely on that", so I don't know how we could guarantee something that users are not allowed to rely on. I still don't know how we can write anything better than (this):

We could document that whether volatile loads and stores are atomic and synchronize in any particular way (for a particular memory addresses, memory access sizes, target, target features, target cpu, etc.) is unspecified. That is, implementations are not required to document this behavior, and the volatile loads and stores are not guaranteed to be atomic. If misuse introduces a data-race, the behavior is undefined.

This allows users that check what the backend does for a particular architecture to rely on that information, and if they mess up, the behavior is undefined.

@comex mentions that we could guarantee that this works for "integer that fit in a register", but they showed above that this is not true, e.g., on x86, where u128 fits on many x86 registers (up to 512-bit wide), yet volatile loads and stores to u128 are not atomic relaxed.

AFAICT, on a 32-bit arch 64-bit volatile loads/stores might not be atomic; on a16-bit arch, 32-bit loads/stores might not be atomic either, etc. Maybe at best we can guarantee that 8-bit wide volatile loads and stores are always atomic, by stating that we only will ever support platforms where this is the case, and if some platform does not satisfy this, we'll never support it.

Please do suggest specific text about the guarantees that unsafe code is allowed to always rely on when working with volatile loads / stores. Talking about a concrete snippet of wording is IMO easier than talking on the "abstract", because one can more easily show counter-examples that prove the wording incorrect (e.g. *mut [u8; 4096], u128, etc. ).

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 26, 2019

For my use case, I don't need colloquial atomic: that is, I don't need indivisibility. [...] This is fine for my use case [u16 example].

If we make the behavior unspecified, and you know that u8 volatile loads/stores are always "relaxed atomic" in the platforms you are targeting, you can probably just use those. AFAIK the unaligned SIMD vector load is not atomic on x86 (they generate a couple of uops), but you can use it to read [u8; 16] bytes from a *mut u8, where the platform guarantees you that it won't give you partially modified bytes.

@hsivonen
Copy link
Member Author

If we make the behavior unspecified, and you know that u8 volatile loads/stores are always "relaxed atomic" in the platforms you are targeting, you can probably just use those.

Is there a concrete need to make them all the way "unspecified" as opposed to "may return unpredictable values if the memory locations are concurrently written to"?

AFAIK the unaligned SIMD vector load is not atomic on x86 (they generate a couple of uops), but you can use it to read [u8; 16] bytes from a *mut u8, where the platform guarantees you that it won't give you partially modified bytes.

I'm OK with receiving partially modified bytes. I'm just not OK with optimizations that would introduce security bugs in that case. As soon as there's a second thread writing to the memory that I'm reading, I no longer care about what values I read and only care about not having a security bug.

@hsivonen
Copy link
Member Author

hsivonen commented Jun 26, 2019

Please do suggest specific text about the guarantees that unsafe code is allowed to always rely on when working with volatile loads / stores.

Would the following be true given what LLVM provides (and has to keep providing for real-world C use cases)?

My use case needs:

  • The compiler will not assume that it can read back a value from a location that is being written to by a volatile write (it will not use locations written to as volatile as spill space and will assume the memory may have been externally modified).
  • The compiler will not assume that two loads from a locations read by a volatile read yield the same value (it will assume the memory location may have been externally modified).
  • The compiler will not generate synchronization for volatile reads and writes (no implicit locks).

My use case doesn't need, but I gather the original purpose of volatile does:

  • For volatile reads and writes of aligned integer types of the size of usize or smaller, the load or store is indivisible (other cases may tear).
  • For volatile reads and writes of aligned integer types of the size of usize or smaller, each read or write visible in the source code results in exactly one load or store instruction generated without widening or narrowing (no compiler-invented extra loads or stores and no elimination of dead stores or repeated loads).
  • The compiler will not reorder volatile operations relative to each other, but will not prevent the CPU from performing the kind of reorderings that the CPU is permitted to perform for plain loads and stores.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 26, 2019

Is there a concrete need to make them all the way "unspecified" as opposed to "may return unpredictable values if the memory locations are concurrently written to"?

Unspecified just means that we don't specify what happens. How are "unpredictable values" any more specific than that? AFAICT "unpredictable" allows any value. Trying to be more specific here would probably require introducing a new atomic memory ordering weaker than relaxed (e.g. with support for word tearing due to concurrent writes, and specifying which values each word is allowed to take).

My use case needs:

AFAIK all of these are guaranteed by the current specification of read/write volatile in Rust.

My use case doesn't need,

AFAIK the first two are not guaranteed, don't know about the third one. I don't know if "usize or smaller" is something that we could guarantee (e.g. CHERI has 128-bit wide usize).

@RalfJung
Copy link
Member

(I don't have time to read this exploding thread fully now, hopefully I'll get to it tonight. But please keep the discussion here focused on the interaction of volatile accesses and concurrency. Things like tearing and specifying the semantics of volatile while avoiding low-level concepts such as "load instructions" already have a topic at #33, let's not duplicate that discussion.)

@comex
Copy link

comex commented Jun 26, 2019

@Centril

It seems to me that if you want to provide functionality that is specified as deferring to hardware semantics, and which is inherently architecture-dependent, then it is better to provide these through an explicitly architecture-dependent mechanism instead of calling it "implementation defined". For this, we have the core::arch module. If a few architectures provide the same guarantee for the same family of functions, then an interface can be built over that in a third-party crate.

If volatile didn't exist already, I might agree with you. But read_volatile and write_volatile are stable. They are explicitly intended for interacting with MMIO registers, and existing code uses them for that purpose. Correctly interacting with MMIO requires the single-instruction guarantee for applicable integer types, so Rust must provide that guarantee, or something equivalent to it.

It could perhaps be worded in terms of a single "memory access" rather than a single "instruction", but I don't see much difference; neither of those concepts can be defined without some reference to a machine model.

@gnzlbg

I still don't know how we can write anything better than this:
[..]
@comex mentions that we could guarantee that this works for "integer that fit in a register", but they showed above that this is not true, e.g., on x86, where u128 fits on many x86 registers (up to 512-bit wide), yet volatile loads and stores to u128 are not atomic relaxed.

I believe that the correct definition is inherently architecture-specific. Better than "size of a register" is "size of a general-purpose register": that works for most architectures, but not all, e.g. Wasm doesn't even have a concept of a register.

But it should be possible to establish rules like: "On x86_64, calls to volatile_load and volatile_store with integer types from 8 to 64 bits are guaranteed to perform a single access at the architectural level, of the correct size."

This certainly shouldn't be unspecified, and I don't think it should even be implementation-defined per se, in the sense that some alternate backend could decide to behave differently. It should be required for any Rust implementation targeting x86_64. (At least, unless their interpretation of "x86_64" is something so weird and nonstandard that the rule somehow wouldn't make sense for that implementation.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 26, 2019

They are explicitly intended for interacting with MMIO registers, and existing code uses them for that purpose. Correctly interacting with MMIO requires the single-instruction guarantee for applicable integer types, so Rust must provide that guarantee, or something equivalent to it.

Does any backend guarantee this (LLVM, GCC, or Cranelift)?

This certainly shouldn't be unspecified, and I don't think it should even be implementation-defined per se, in the sense that some alternate backend could decide to behave differently. It should be required for any Rust implementation targeting x86_64. (At least, unless their interpretation of "x86_64" is something so weird and nonstandard that the rule somehow wouldn't make sense for that implementation.)

Allowing the behavior to change across implementations (targets, other backends, other toolchains) and being required to document the behavior reads like the definition of implementation defined.

"On x86_64, calls to volatile_load and volatile_store with integer types from 8 to 64 bits are guaranteed to perform a single access at the architectural level, of the correct size."

Is this true for all x86_64 hardware? EDIT: I think so, but note that this is true independently of whether the load is volatile or not.

@hsivonen
Copy link
Member Author

Unspecified just means that we don't specify what happens. How are "unpredictable values" any more specific than that? AFAICT "unpredictable" allows any value.

I'm trying to capture that the values may be weird but there is no UB. I'm not sure what the implications of "unspecified", which I believe to be a special term, are.

AFAIK the first two are not guaranteed, don't know about the third one.

They seem to be guaranteed by this C++ proposal, which I believe to try to capture the behavior of the LLVM intrinsics that Rust builds upon here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r0.pdf

My only issue with that paper is that its item "i" says that certain volatile accesses don't constitute data races, which implies there are others that might. I'd be happier if volatile accesses categorically didn't constitute data races. (Other types of accesses to the same memory locations could still constitute data races resulting in asymmetric UB, but asymmetric UB is exactly what I'm looking for here.)

@Lokathor
Copy link
Contributor

They are explicitly intended for interacting with MMIO registers, and existing code uses them for that purpose. Correctly interacting with MMIO requires the single-instruction guarantee for applicable integer types, so Rust must provide that guarantee, or something equivalent to it.

Does any backend guarantee this (LLVM, GCC, or Cranelift)?

Yes, LLVM does at least.

IR-level volatile loads and stores cannot safely be optimized into llvm.memcpy or llvm.memmove intrinsics even when those intrinsics are flagged volatile. Likewise, the backend should never split or merge target-legal volatile load/store instructions.

And then there's another paragraph that makes it extra clear that they intend you to be allowed to execute native width loads and stores as a single volatile instruction

Rationale
Platforms may rely on volatile loads and stores of natively supported data width to be executed as single instruction. For example, in C this holds for an l-value of volatile primitive type with native hardware support, but not necessarily for aggregate types. The frontend upholds these expectations, which are intentionally unspecified in the IR. The rules above ensure that IR transformations do not violate the frontend’s contract with the language.

@rpjohnst
Copy link

@Centril

It seems to me that if you want to provide functionality that is specified as deferring to hardware semantics, and which is inherently architecture-dependent, then it is better to provide these through an explicitly architecture-dependent mechanism instead of calling it "implementation defined". For this, we have the core::arch module. If a few architectures provide the same guarantee for the same family of functions, then an interface can be built over that in a third-party crate.

On top of @comex's point that this is a no-go for stability reasons, volatile isn't really arch-specific anyway. This sounds like a similar problem to guaranteed copy/move elision- we care about something that's maybe outside the abstract machine, that normally the optimizer could "as-if" away.

In this case I don't think it's too difficult to fix that, because the reason for volatile is that loads and stores can have arbitrary side effects. The implementation ("please actually emit this, don't reorder it, assume it can mutate shared state, etc") is the same thing you need for any other opaque side-effecting operation- calling a function through FFI or an unknown function pointer, etc.

There's certainly a lot of wiggle room there around which optimizations you want to enable vs which actual side effects volatile operations will perform, but it hardly belongs in core::arch.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 26, 2019

@Lokathor the text itself does not guarantee that volatile load / stores lower to a single atomic instruction, and instead only says:

A volatile load or store may have additional target-specific semantics.

When the "Rationale" section says that "volatile load / stores of primitive types with native hardware support lower to a single instruction" it is providing a guarantee that the text above does not provide. It still does not say that this instruction needs to be atomic (e.g. cannot tear), and I asked in #llvm@freenode today and was told that LLVM does not allow frontends to query whether a type is a primitive type with native hardware support for a particular target.

@rpjohnst

There's certainly a lot of wiggle room there around which optimizations you want to enable vs which actual side effects volatile operations will perform, but it hardly belongs in core::arch.

With @Centril's solution, we could add to core::arch:

  • arch::wasm: no intrinsics
  • arch::x86: only volatile load/stores for u8/16/32
  • arch::x86_64 volatile load/store for u8/16/32/64 without tearing, and volatile load/store u128/256/512 with tearing (using different descriptive names),
  • etc.

These intrinsics could guarantee that only a single instruction is emitted, whether the load/stores are atomic or they can tear, etc. Unsafe code could then safely rely on these guarantees, and we could provide implementations of these that are independent from what LLVM lowers volatile load/stores to (e.g. using inline assembly).

The generic ptr::read_volatile<T>/write_volatile<T> intrinsics would first require generic guarantees that hold for all Ts, and then platform specific guarantees that only hold for some targets and some Ts, and users would at best need to read the docs to figure out these guarantees (and at worst, read the backend, inspect the generated code, etc.).

@asajeffrey
Copy link

Yes, by "it" I meant the SharedAddressRange. The two individual fields are each atomic, but they are not synchronized.

If process B truncates the file, there's nothing we can do about it, unless then OS gives some protection against shared memory file truncation. But otherwise, ti doesn't matter what values A reads from the SharedVec, there's no UB.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 15, 2019

The two individual fields are each atomic, but they are not synchronized.

Is it ok for them to not be synchronized? For example, consider the initial shared memory content is (len0, addr0):

  • Process B needs a larger file, allocates it, and writes (len1, addr1)
  • Process A wants to read from the file, and it reads the inconsistent state (len1, addr0) because it happens in between Process B's atomic writes.
  • Process A access addr0 with len1

I'm also not really sure what happens if one process reads a consistent state, e.g., (len0, addr0), and creates a &[T], and after that, a different process reallocates to a different (len1, addr1). On Linux the (len0, addr0) file is only deallocated once the last process that has it opened closes it, so maybe on Linux 128-bit wide atomic accesses would be enough.

When a process reads a (len, addr) pair, does each process validate addr, and len? E.g. a non-atomic write to addr from a different process could be split into ~4 writes (once per struct field), so a process doing an atomic read of addr can end up reading "garbage".

@asajeffrey
Copy link

At the moment, SharedVec doesn't support a grow() method. If it did, the implementation would be something like update the addr then the len. since the accesses are SC, the reader will see the len update after the addr update, and the old len is valid for the new addr, so yay still no UB. This is quite brittle code, a it relies on the readers and writers using SC and doing the accesses in a particular order.

It's very important for correctness that shared memory files aren't closed or truncated while any process has an fd for it. All processes revalidate whenever they convert a shared address range into a reference.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 15, 2019

I think that's sound, but you probably want to perform a 128-bit atomic write instead (e.g. using cmpxchg16b) if available. Consider initial state (len0, addr0):

  • Process A grows to len1, needs reallocation. Writes addr1, then len1
  • Process B grows to len2 < len1, needs reallocation. Writes addr2, then len2
  • Process C reads the vector, and gets (addr2, len1) but addr2 length is smaller than len1. When trying to get a slice, validation fails.

I think you can probably support resize and truncate as well by truncating to a new file. That is, you create a smaller file, copy there, and then replace addr and len. Processes holding the old addr and len can still access the old file, and when no processes have a reference to it, the file gets deallocated.

But yeah, as long as your OS supports creating the file in such a way that in-place truncation is not allowed (and that you can test that this is the case, e.g., in case you open a file created by another process), then the API looks sound and super useful. Let me know if you need a mach implementation, should be possible for OSX as well.

@asajeffrey
Copy link

If Rust gives me an AtomicU128 then I'll use it!

@comex
Copy link

comex commented Nov 15, 2019

Yikes. Why is that still unstable? ...Apparently because it's only supported on a few targets and requires a target_feature on x86. But other atomics were stabilized, despite not being supported on all platforms, because they're supported on most of them. Weird situation.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 15, 2019 via email

@comex
Copy link

comex commented Nov 16, 2019

core::arch::x86_64::cmpxchg16b is unstable as well.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 16, 2019

@comex Weird, I thought I saw a PR to stabilize that :/

@DemiMarie
Copy link

DemiMarie commented Nov 30, 2019

Pre-Pre-RFC: core::arch::{load, store}

This proposes a type Address of raw machine addresses, functions load and store for raw hardware loads and stores, and a concept of an always-valid type that can safely be cast to a byte array.

The functions proposed here have the same semantics as raw machine load and store instructions. The compiler is not permitted to assume that the values loaded or stored are initialized, or even that they point to valid memory. However, it is permitted to assume that loads via store do not violate Rust’s mutability rules.

In particular, it is valid to use these functions to manipulate memory that is being concurrently accessed from another thread. This is true no matter what method the other thread is using to do so. Therefore, they can be used to access memory that is shared with untrusted code. For example, a kernel could use them to access userspace memory, and a user-mode server could use them to access memory shared with a less-privileged user-mode process. It is also safe to use these functions to manipulate memory that is being concurrently accessed via DMA, or that corresponds to a memory-mapped hardware register. Therefore, the Rust compiler is never allowed to make assumptions about the memory accessed by these functions. It is even safe to access invalid memory with these functions, although this will cause a hardware fault.

These functions are intended to be primarily used in kernels, hypervisors, drivers, and other processes that need to access memory that can be concurrently modified by DMA or by potentially-malicious code.

type Address = usize;

The type of raw machine addresses. This is usually the same as usize, but may differ on obscure architectures.

unsafe fn load<T>(Address addr) -> T;

Load a T from a specified machine address.

For such a call to be valid, the following must hold:

  • T must have defined ABI.
  • T must not contain any padding.
  • All bit-patterns for T must be valid

Such types are called always-valid. All primitive integer types are always-valid. A struct is always-valid if and only if:

  • It has a defined ABI.
  • It does not have padding.
  • All of its fields are always-valid.

Other types may be always-valid, but this is not guaranteed. Using this function with a T that is not always-valid is a compile-time error. It is also a compile-time error if the compiler cannot guarantee that it will lower a call to a single machine instruction.

The behavior of this function is the same as that of a single load instruction on the underlying hardware. If the compiler cannot ensure this, it must emit an error. In particular, an error will always be emitted when targeting WASM.

Safety

This function never invokes UB on its own. However, it may trap, or cause side-effects via MMIO. Such side-effects may themselves cause UB.

unsafe fn store<T>(Address addr, T arg);

Stores a T to a specified machine address addr. T must be always-valid as described above.

The behavior of this function is the same as that of a single store instruction on the underlying hardware. If the compiler cannot ensure this, it must emit an error. In particular, an error will always be emitted when targeting WASM.

Safety

This function may corrupt memory, cause a hardware fault, or cause side-effects via MMIO.

@HadrienG2
Copy link

Here are some questions to help you refine this proposal:

  • Why do you need a new machine address type instead of reusing existing raw pointer types?
  • Why do you need new "defer to hardware semantics" load/store instructions when that is already what volatile is intended to do?
    • To be clear, we are all well aware that volatile isn't guaranteed to provide those semantics today, the question here is why we should introduce new semantics instead of fixing those of the volatile accesses that we already have.
  • Why can't the compiler assume that the target memory is initialized? What does the uninitialized memory UB buy you in this case, since you're reading from that memory anyway?
  • How does this integrate with the rest of the language? For example, what kinds of instruction reorderings, additions, duplications and elisions are permitted in code that contains those instructions?
  • How do we enforce these semantics at the backend level, when the closest that LLVM provides us with are volatile semantics, which are way fuzzier than what any of us would like?
  • How do you define "have defined ABI" in "T must have define ABI"?
  • Since the Rust Abstract Machine, which the language is largely specified in terms of, does not currently have a notion of traps, faults and MMIO, what do you gain by introducing those notions instead of reusing the UB semantics which we already have for such cases?

@RalfJung
Copy link
Member

RalfJung commented Nov 30, 2019

@DemiMarie could you move that pre-pre-RFC to a new thread (as an issue here or on IRLO), please? This thread here is already long enough without discussing a new RFC.

We should find a way to close this issue, but I am not sure how. Someone would have to properly summarize 120 comments... but, please don't add anything new to this issue! It is open just because we don't have the resources to write a proper summary.

@DemiMarie
Copy link

@RalfJung done: #321

@ghost
Copy link

ghost commented Sep 28, 2022

I'm working on a project where I'm running into this same question. I'm hoping I can summarize this thread and in the process check my understanding. Please let me know if any of the following is incomplete/incorrect!

A Rust program that wants to share memory with untrusted thread(s) external to the program (e.g. a kernel accessing userspace memory) needs both volatile semantics and atomic semantics.

Volatile semantics are necessary to ensure that memory accesses are not optimized away (i.e. interacting with externally shared memory needs to treated like any other IO). Atomic semantics are necessary to ensure that there are no data races. This is true even if we don't want/need to synchronize with the external thread(s).

However, neither volatile or atomic semantics on their own are sufficient. Volatile reads and writes are not exempt from data races. Using them in this context would cause undefined behavior (unless there is control of the thread scheduling to prohibit concurrency). Additionally, atomic operations in theory could be optimized away because the compiler doesn't know we're working with externally shared memory. (At least, that seems to be why C++ decided to keep around their volatile atomics when deprecating volatile.)

There is some evidence that using ptr::read_volatile/ptr::write_volatile works in practice for this, and some evidence that LLVM currently does not make optimizations that exploit data races resulting from volatile accesses.

Since we don't currently have a mechanism in the language or any library that has the right combination of semantics, it seems that the most practical recommendation right now is to use inline assembly. Assuming we understand the hardware semantics, we can avoid data races and impose the right memory ordering by using the right instructions, and we're guaranteed that the memory accesses in the inline assembly won't be optimized away.

In the future, we might adjust the semantics of ptr::read_volatile/ptr::write_volatile or atomics, or add something new to the language or standard library that better addresses this use case.

@Amanieu
Copy link
Member

Amanieu commented Sep 29, 2022

The issue with atomics isn't about unknown threads of execution. I believe the consensus from #215 is that Rust cannot assume that external threads mutating our memory don't exist, so any optimizations rely on that are invalid.

The issue is that the memory model says that if a data race occurs then the whole program is UB. It doesn't have a way to "scope" the UB to a specific thread when one thread is using atomic operations while another is using atomic operations. To interface with untrusted threads in shared memory, we would need an additional guarantee that if we only touch the shared memory with atomic operations then we can't get any UB on our side despite anything the untrusted thread does.

@RalfJung
Copy link
Member

@akiekintveld yes that sounds like a good summary!

Also see #321 for a proposal for new volatile operations.

The issue is that the memory model says that if a data race occurs then the whole program is UB. It doesn't have a way to "scope" the UB to a specific thread when one thread is using atomic operations while another is using atomic operations. To interface with untrusted threads in shared memory, we would need an additional guarantee that if we only touch the shared memory with atomic operations then we can't get any UB on our side despite anything the untrusted thread does.

That is also a problem. But it is still true that neither volatile nor atomics alone are sufficient. Even if there are other threads in the Rust AM that doesn't mean the compiler has to preserve atomic accesses in a way that is suitably observable outside the Rust AM. (Though I cannot quite imagine optimizations that might break this, I still think that communication with code outside the Rust AM is different from communication with other threads in the Rust AM.)

@Amanieu
Copy link
Member

Amanieu commented Sep 29, 2022

I suppose you could treat the untrusted thread as being outside the Rust AM (since it is compiled as separate program). At that point you can treat it as a series of assembly instructions for which all memory accesses are effectively atomic.

@VorpalBlade
Copy link

The last update on this was two years ago. Has there been any improvement to the situation or are we still relegated to unportable inline assembly?

What would be needed to push forward for a proper solution to this issue? I'm looking at the use case of shared memory IPC between a privileged process and a sandboxed process (both of which can have multiple threads).

@Amanieu
Copy link
Member

Amanieu commented Sep 30, 2024

rust-lang/rfcs#3301 should be a sufficient solution to this: the untrusted code can be considered at the assembly level, where all reads and writes by individual instructions can be treated as relaxed atomic operations. You can then just use normal atomics for synchronization and AtomicPerByte for data access in the IPC region.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-memory Topic: Related to memory accesses C-open-question Category: An open question that we should revisit Triaged Visited during a backlog bonanza meeting
Projects
None yet
Development

No branches or pull requests