Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: Add _THREAD_SLEEPING thread state #81684

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

peter-mitsis
Copy link
Collaborator

@peter-mitsis peter-mitsis commented Nov 21, 2024

As has been previously noted elsewhere, Zephyr's performance on the thread_metric preemptive benchmark had been observed to be significantly behind that of some other RTOSes such as ThreadX. One of the contributing factors of this has been the call to z_abort_thread() in k_thread_suspend(). Suspending a thread really should be orthogonal to timeouts and sleeps. This set of commits aims to both correct that and improve Zephyr's preemptive benchmark performance. When applied atop of the other performance related PRs, this patch set gives us numbers that are about 9% better.

To decouple the two, a new thread state has been introduced--_THREAD_SLEEPING. As all the existing bits in the thread_state field are used, the size of this field must be increased from an 8-bit field. Should this field be increased on its own, this will introduce padding gaps in the layout of the _thread_base structure. To counteract the padding two other fields have also had their sizes modified.
user_options has been increased to 16 bits (it was getting closer to being full).
cpu_mask has been made to always be 16-bits.
These changes can be expected to have an impact on 3rd party tools.

This decoupling also results in some behavior changes.

  1. A thread that has gone to sleep forever will no longer be resumed by k_thread_resume(). Such a thread is awakened with k_wakeup().
  2. Suspending a thread does not cancel any timeouts, thereby allowing a thread to be both sleeping and suspended, or pending with a timeout and suspended.

Below are some performance numbers using the thread_metric's preemptive benchmark with multiq on the disco_l475_iot1 (higher is better)

Main branch: 5731301
This commit atop main: 5854334
PR #81311 and #81677 together: 6501273
This commit atop both #81311 and #81677: 7034780

======================

EDIT:

The numbers reported above are for the previous version of this patch set. There has since been other changes to k_thread_suspend() that render the performance boost moot. Furthermore, the size and placement of the thread_state, cpu_mask and user_options fields are no longer changing with this patch set.

What this set of patches does it introduce the _THREAD_SLEEPING bit (in place of the now gone _THREAD_PRESTART) bit. This helps decouple sleeping states from suspended states allowing a thread to be both sleeping and suspended. As part of this decoupling, threads are no longer created in the suspended state, but rather in the sleeping state. This seems a more intuitive state for threads that have a delay (whether finite or forever)--it is as if the first thing that they do is sleep.

One of the motivations for this stems from a model that k_thread_suspend() should not be messing with a thread's timeout.

@teburd
Copy link
Collaborator

teburd commented Nov 21, 2024

I believe the changes look great. The performance gains are excellent. The commit messages I believe deserve a more detailed explanation. The why and what of the changes in each commit for this sort of PR really deserve more explanation in my opinion, particularly the last commit. Why is this netting us nearly 10% bump in performance? How did you find this?

@peter-mitsis
Copy link
Collaborator Author

  1. I have expanded upon the commit messages. I hope that they are better.
  2. This latest revision should fix the build errors from the previous CI run.

@teburd
Copy link
Collaborator

teburd commented Nov 21, 2024

Second commit has a small typo, otherwise the messages are very clear and helpful now. Thanks!

"At the present time, Zephyr does has overlap between sleeping and
suspending."

Likely meant

"At the present time, Zephyr does have overlap between sleeping and
suspending."

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: As suspending a thread no longer aborts a thread's timeout

Isn't that a bug? What happens after this patch when you suspend a thread that is blocked in k_sleep()? What happens now (or should happen, I think, absent bugs) is that it will suspend indefinitely, and then wake up (potentially earlier than the sleep timeout) when you resume it.

But now it's going to take a timeout ISR at some point in the future and presumably un-suspend unexpectedly?

@andyross
Copy link
Contributor

Also the expansion of the thread state flags just bugs me aesthetically. We have too many of these already. Maybe we can separate the "true flag" ones (e.g. "ABORTING/SUSPENDING", "QUEUED") with the "enumerative" ones that are-or-at-least-should-be mutually exclusive (DEAD, PENDED, now SLEEPING/SUSPENDED), etc... With some work we could probably move obscure stuff like DUMMY and PRESTART into some other state and get it out of the mask byte. Likewise ABORTING and SUSPENDING are 98% the same state and could be discriminated in other ways than the flags.

@andyross
Copy link
Contributor

And finally: this is the third PR now that's come along chasing performance numbers in k_thread_suspend(), which really shouldn't need to be a performance path, IMHO. That's an obscure (and extremely race-prone!) API that real apps shouldn't be relying on. This is like chasing a bunch of Linux performance numbers by looking at the cycle times of kill().

Can someone point to the test in question?

@teburd
Copy link
Collaborator

teburd commented Nov 21, 2024

And finally: this is the third PR now that's come along chasing performance numbers in k_thread_suspend(), which really shouldn't need to be a performance path, IMHO. That's an obscure (and extremely race-prone!) API that real apps shouldn't be relying on. This is like chasing a bunch of Linux performance numbers by looking at the cycle times of kill().

Can someone point to the test in question?

https://github.com/zephyrproject-rtos/zephyr/tree/main/tests/benchmarks/thread_metric

Originally from embedded.com I believe in 2007, https://www.embedded.com/measure-your-rtoss-real-time-performance/

Stems from a report posted by Beningo (there's a pdf/slide set floating out there...) which you can see some of the results easily at https://www.embedded.com/how-do-you-test-rtos-performance/ showing Zephyr performing poorly

@andyross
Copy link
Contributor

Sigh, ew. Apologies in advance for the Torvaldsist rant, but it really has to be said. That is just embarrassingly pessimal. Basically the test has a bunch of threads it wants to run at different times, and is doing it with tm_thread_suspend/resume() calls. Which I guess the porter has mapped to k_thread_suspend/resume().

But again, those aren't Zephyr performance APIs, aren't used by actual apps in the wild, and really aren't something we should be introducing complexity to try to make fast.

We have fast APIs, it just isn't these. We should fix this to make "self-suspend thread N" be "k_sem_take(semaphore_N)", and "resume thread_N" be "k_sem_give(semaphore_N)". Semaphores are, have been, and likely always will be our go-to lightweight/fast/best synchronization primitive.

I'm not opposed in principle to making suspend/resume faster, but not at the cost of complexity and absolutely not if it turns out it's just because we were measuring the wrong thing.

@teburd
Copy link
Collaborator

teburd commented Nov 21, 2024

Sigh, ew. Apologies in advance for the Torvaldsist rant, but it really has to be said. That is just embarrassingly pessimal. Basically the test has a bunch of threads it wants to run at different times, and is doing it with tm_thread_suspend/resume() calls. Which I guess the porter has mapped to k_thread_suspend/resume().

But again, those aren't Zephyr performance APIs, aren't used by actual apps in the wild, and really aren't something we should be introducing complexity to try to make fast.

We have fast APIs, it just isn't these. We should fix this to make "self-suspend thread N" be "k_sem_take(semaphore_N)", and "resume thread_N" be "k_sem_give(semaphore_N)". Semaphores are, have been, and likely always will be our go-to lightweight/fast/best synchronization primitive.

I'm not opposed in principle to making suspend/resume faster, but not at the cost of complexity and absolutely not if it turns out it's just because we were measuring the wrong thing.

This is exactly the same API style found in ThreadX (tx_thread_suspend/tx_thread_resume) and FreeRTOS (vTaskSuspend/vTaskResume). The obvious choice is to do the same for Zephyr. The non-obvious choice is I guess whatever was undocumented as the performance API version of these. Not to be too snarky here, really, but we can't expect people to do the non-obvious thing. Particularly if all the other options have like-named things.

@peter-mitsis
Copy link
Collaborator Author

But now it's going to take a timeout ISR at some point in the future and presumably un-suspend unexpectedly?

No. A suspended thread is blocked, but a blocked thread is not necessarily suspended, and sleeping != suspended. When we suspend a thread, we are telling it "You will not run for as long as you are suspended. You may get your resources if you are pended on an object, but you will not run if you are suspended. Your timeouts may expire, but you will not run if you are suspended. The only way to get not-suspended is to be resumed." The world passes you by when you are suspended.

Do I think that the thread_metric preemptive benchmark is great? No. But it is consistent and not unreasonable. Furthermore, it is out there, and Beningo's report and the perception of Zephyr needs to be addressed.

@andyross
Copy link
Contributor

What I wrote is too complicated. Simple version: "abort timeout" should just be a quick check that a pointer field is null. If that's not what it's doing, let's fix that first before re-architecting sleep.

@andyross
Copy link
Contributor

Another suggestion: very likely what our competitors are doing in that code is detecting the case where we're suspending the current thread and then implementing an optimized handler[1]. Whereas our suspend implementation is a general thing intended to asynchronously stop a thread in any state (and across CPUs!). Maybe we could add a test for that at the very top and do even better still?

[1] Among other things, _current never has a timeout, heh.

@andyross
Copy link
Contributor

Yeah, I'm liking that better. So with as pretentious a flourish as I can muster, here's my mostly-but-not-entirely untested (it runs tests/kernel/common and tests/kernel/sched/schedule_api on qemu_cortex_m3) attempt at "Kick ThreadX's butt on a dumb benchmark" patch:

diff --git a/kernel/sched.c b/kernel/sched.c
index eda1a3e0908..cf5094639cc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -499,6 +499,21 @@ void z_impl_k_thread_suspend(k_tid_t thread)
 {
        SYS_PORT_TRACING_OBJ_FUNC_ENTER(k_thread, suspend, thread);
 
+       /* There are benchmarks in the wild which use
+        * k_thread_suspend() to force a context switch, which is not
+        * really an appropriate use of the API (suspend is inherently
+        * unsynchronized and deadlocky!).  But make it fast just for
+        * numbers.
+        */
+       if (thread == _current && !IS_ENABLED(CONFIG_SMP)) {
+               K_SPINLOCK(&_sched_spinlock) {
+                       z_mark_thread_as_suspended(thread);
+                       dequeue_thread(thread);
+                       update_cache(1);
+               }
+               return;
+       }
+
        (void)z_abort_thread_timeout(thread);
 
        k_spinlock_key_t  key = k_spin_lock(&_sched_spinlock);

Basically this just detects the specific case the benchmark is hitting and extracts the minimal work required to suspend the current thread, doing nothing else. The SMP test is there just to make the dequeue_thread() call unconditional, you'd otherwise have to test is_queued(). And as we aren't obviously losing SMP benchmarks to UP only toy systems, I didn't see the point.

@andyross
Copy link
Contributor

andyross commented Nov 21, 2024

The SMP test is there just to make the dequeue_thread() call unconditional, you'd otherwise have to test is_queued().

Which is wrong, sorry. _current is never queued on SMP, so you'd just do this to support both:

// oops

Never mind. There's more stuff to do in SMP like handling async racing aborts and stuff and it'll never be this simple there. So the original as written is best.

@peter-mitsis
Copy link
Collaborator Author

Your proposal still has suspend muck with the timeout, which it should not do.

A thread pending on a semaphore with a timeout should not lose the timeout if the thread is suspended. The following sequence should be valid ...

  1. @t(0) a thread waits on a semaphore for 60 seconds.
  2. @t(5 seconds), that thread is suspended.
  3. @t(20 seconds), that thread is resumed.
  4. @t(60 seconds) the semaphore times out and that thread wakes.

Presently, Zephyr behaves incorrectly--it aborts the timeout on suspend and turns it into a K_FOREVER. One of the aims of this patch set is to correct that. The manner in which it corrects it both introduces _THREAD_SLEEPING, and more clearly separates suspend from sleeping. In the process, it also improves the benchmark performance.

@andyross
Copy link
Contributor

Huh. Looks like it actually works. Submitted in #81737

@andyross
Copy link
Contributor

Presently, Zephyr behaves incorrectly

That's a reasonable argument. But it's behaving as documented, I believe, and it's done it for as long as the APIs have existed. Suspending a thread overrides any sleep it might be doing and guarantees the kernel will do no work on behalf of the thread in the future, until it's resumed (at which point it returns from k_sleep(), potentially earlier than intended -- again early exit from k_sleep() is a documented behavior).

I'm not saying no to redoing sleep (though I'd really like to see it done in a way that doesn't duplicate code), I'm just freaking out that we're trying it out of desire to chase a benchmark.

@peter-mitsis
Copy link
Collaborator Author

Originally, it was explicitly documented that the timer would keep ticking. The documentation for k_thread_suspend() in kernel.h was changed back in 2019 (see SHA ID 50d0942) as part of #20947 (for issue #20033)

There is documentation that has been unchanged since it was first written back in 2015 under doc/kernel/services/threads/index.rst wherein there is a note clearly stating that k_sleep() is different from suspending a thread.

@andyross
Copy link
Contributor

Nice catch, I have no memory of that exchange. But again, doesn't this patch just re-open #20033? That user viewed the behavior in this PR as a fatal bug. The behavior I think the old documentation wanted to see, but that never actually existed, doesn't exist here either, no? If you suspend a thread after this merges, and it has a live timeout, that timeout will wake it up unexpectedly. Or am I misreading? Needs a test for sure.

@peter-mitsis
Copy link
Collaborator Author

I can write a test and add it to the patch. :)

@andyross
Copy link
Contributor

andyross commented Nov 22, 2024

Came back up to check. Yeah, all but certain this PR needs more. Right now the timeout handler looks like this and calls directly into z_sched_wake_thread(). That's going to unconditionally[1] mark the thread as not suspended and then call into ready_thread(). So it's guaranteed to ignore whatever suspend might have done.

https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/sched.c#L614

[1] It does have a test for dead/aborting threads, which looks suspiciously needless to me as I'm pretty sure abort is now race free. It's possible this is some vestigial code trying to address misbehavior in abort that could now be removed, though I'd be moderately afraid of touching this. Digression, but bears investigation.

@peter-mitsis
Copy link
Collaborator Author

... Right now the timeout handler looks like this and calls directly into z_sched_wake_thread(). That's going to unconditionally[1] mark the thread as not suspended and then call into ready_thread().

The 3rd commit in this set updates z_sched_wake_thread() to replace the unconditional not suspended with an unconditional not sleeping.

@andyross
Copy link
Contributor

Swinging back. And indeed, thanks. I'd immediately looked for a change to the timeout handler and clearly missed that. Definitely still needs a test for the new behavior, but I'm convinced this is correct.

Still pretty ick'd about the extra byte in the flags though, as those are sort of a mess and I'd really like to see them shrink and not grow. Maybe we could start with some refactoring there to make room first before adding the byte and rearranging the thread struct?

One idea that occasionally occurs to me is that "PRESTART" is almost entirely useless. We should just initialize the thread in the SUSPENDED state (or now, I guess, SLEEPING if it has a timeout set at creation time). Though that requires more work to k_thread_start() to unify it with k_thread_resume().

Also states like DEAD and DUMMY don't really reflect "thread" states at all as their thread structs are degenerate (literally the only valid data in them is the flag bits!). We could detect that by, for example, unioning any other pointer in the struct (maybe the head pointers in the wait_q node) and aiming at a special "dead/dummy" record which can never be valid (maybe a backpointer to the thread struct itself). This might actually save code size as you could test it with a single load and no bit ops.

Likewise PENDING can be tested by looking at the link status of the wait_q pointers, it dates from the days before dlist_node_is_linked(). (Though the rbtree backend leaves the pointers garbage on unlink I think and would need to be updated?)

Likewise QUEUED should be synonymous with the run queue node being linked, plus maybe some knowledge of code state during scheduler operations. Though here I have a vague memory of having tried to fix this and failed, so it might be too subtle to be worth it.

Do all of those and we get to a point where the only thread states we have are ABORTING, SUSPENDING, and SUSPENDED (and now SLEEPING). All states that actually do apply as pretty obvious flags on live threads. We could keep going at that point and have a "RUNNING" state marked somewhere, because the first two only apply to runnable threads and the last two only to non-runnable ones, so they don't need separate spots. Though at that point I think no one is going to care.

But ideally we could make at least one bit of space without too much trouble (my guess would be PRESTART would be easiest) and merge it before this series. I'll see if I can find time over the weekend.

@nashif
Copy link
Member

nashif commented Nov 23, 2024

yeah, i would start with PRESTART, DEAD and DUMMY. Zephy seems to be the only RTOS out there with such specialized and rarely used thread states.

@andyross
Copy link
Contributor

See #81885 for a first cut at a thread state diet patch.

@zephyrbot zephyrbot added area: Kernel area: CMSIS API Layer CMSIS-RTOS API Layer area: mcumgr area: Portability Standard compliant code, toolchain abstraction area: Benchmarks platform: Infineon Infineon Technologies AG labels Nov 28, 2024
/* Thread started in sleeping state. Switch to suspended state */

k_thread_suspend(&test_thread[thread_id]);
k_wakeup(&test_thread[thread_id]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not get this sequence here, it is sleeping, we suspend it, then we wake it up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to mark the thread "suspended" (which this test requires, and which brand new threads not longer qualify as) without letting it actually run, I suspect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread_metric benchmarks assume that the thread was started in the suspended state. Transitioning from a sleep to suspended state allows us to maintain that model for the benchmark and keep tm_thread_resume() simple (which in turns helps with the performance numbers).

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big worry about a deadlock (my memory is fuzzy but I'm like 70% certain that operation was outside the lock for a reason), some nitpicking.

kernel/sched.c Outdated
@@ -1135,8 +1133,17 @@ int32_t z_impl_k_sleep(k_timeout_t timeout)

/* in case of K_FOREVER, we suspend */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a lie now

kernel/sched.c Outdated
@@ -1135,8 +1133,17 @@ int32_t z_impl_k_sleep(k_timeout_t timeout)

/* in case of K_FOREVER, we suspend */
if (K_TIMEOUT_EQ(timeout, K_FOREVER)) {
k_spinlock_key_t key = k_spin_lock(&_sched_spinlock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually pretty sure this whole case case be removed now, no? Passing K_FOREVER to z_add_timeout() is a noop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

kernel/sched.c Outdated

(void)z_swap(&_sched_spinlock, key);

__ASSERT(!z_is_thread_state_set(arch_current_thread(), _THREAD_SUSPENDED), "");
__ASSERT(!z_is_thread_state_set(arch_current_thread(), _THREAD_SLEEPING), "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an ancient, sorta voodooish assert. Might be best to remove it unless there's genuine concern we've added a new state mistake.

}

k_spinlock_key_t key = k_spin_lock(&_sched_spinlock);
(void)z_abort_thread_timeout(thread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this is introducing a deadlock. You hold the scheduler lock here and are calling into the timeout subsystem, which will take the timeout lock. If something can hold the timeout lock and call the scheduler, then we die.

Somewhere in the source is a comment describing the relationship between the two locks, but I can't find it right now. In any case the idea is that thread timeouts are safe and idempotent if they happen on a runnable thread, so the idea here is to cancel the timeout probablistically and just accept that the thread might race on the wakeup and end up with a garbage timeout after being awoken.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I forgot to double check about that ... will follow up on that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been pondering this more and doing more reading of the code trying to keep an eye open for races.

As far as I can tell, we do have cases where both _sched_spinlock and timeout_lock are both locked, and it seems OK as long as _sched_spinlock was locked first and unlocked last.

On SMP, I think it is better to put the call to z_abort_thread_timeout() in k_wakeup() within the _sched_spinlock is locked section. It may still be OK to have it outside this section, but I find the logic to prove it gets more complicated (and easy to make a mistake in).

If the call to z_abort_thread_timeout() precedes the locking of _sched_spinlock, consider the following ...

  1. CPU1 is executing thread1 and CPU2 is executing thread2.
  2. CPU1 calls k_wakeup(thread2) and CPU2 calls k_sleep(thread2).
  3. CPU1 aborts thread2's non-existent timeout.
  4. CPU1 and CPU2 start competing for _sched_spinlock, but CPU2 wins.
  5. CPU2 finishes k_sleep() for thread2 (creating a new timeout for it) and unlocks _sched_spinlock.
  6. CPU1 gets _sched_spinlock() and finishes its call to k_wakeup(thread2); however due to the order of things, thread2's timeout is not aborted.

This leads to the question "What happens when thread2's timeout expires and it was not sleeping?"

  • I think it's ok, but proving it gets complicated and error prone.
    That complexity can be bypassed by putting the call to z_abort_thread_timeout() within k_wakeup()'s _sched_spinlock locked section.

At the present time, Zephyr does has overlap between sleeping and
suspending. Not only should sleeping and suspended be orthogonal
states, but we should ensure users always employ the correct API.
For example, to wake a sleeping thread, k_wakeup() should be used,
and to resume a suspended thread, k_thread_resume() should be used.
However, at the present time k_thread_resume() can be used on a
thread that called k_sleep(K_FOREVER). Sleeping should have nothing
to do with suspension.

This commit introduces the new _THREAD_SLEEPING thread state along
with some prep-work to facilitate the decoupling of the sleeping and
suspended thread states.

Signed-off-by: Peter Mitsis <[email protected]>
Sleeping and suspended are now orthogonal states. That is, a thread
may be both sleeping and suspended and the two do not interact. One
repercussion of this is that suspending a thread will no longer
abort its timeout.

Threads are now created in the 'sleeping' state instead of a
'suspended' state. This dovetails nicely with the start delay that
can be given to a newly created thread--it is as though the very
first operation that a thread with a start delay is a sleep.

Signed-off-by: Peter Mitsis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Benchmarks area: CMSIS API Layer CMSIS-RTOS API Layer area: Kernel area: mcumgr area: Portability Standard compliant code, toolchain abstraction platform: Infineon Infineon Technologies AG
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants