-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the slow performance on Windows #101
Conversation
I'm open to changes, so please tag me if you see this @hosaka . I will respond as fast as possible, since my project on Windows needs this. |
Hi @temeddix , thanks for the PR. I am ok with removing the unnecessary locking, this looks like something that was brought over from quamash at some point in the project. Will this not break multi-threaded Qt apps that use qasync (thread workers etc.)? At this point, isn't |
It looks like this caused a number of Windows tests to get stuck. |
Yeah, I'm aware of this problem. I'll try removing the |
Running pytest --full-trace locally: the the last point that causes the test to hang is:
|
I completely removed Windows-specific code, which includes that To me, it looked like there was Windows-specific code because of |
I'm not sure why tests are failing... It works properly on my PC. I will check the tests. |
Is your application single threaded? Try running pytest on a windows machine with your changes (add the --full-trace to see the callstack after Ctrl-C on a stuck test case). It will pass occasionally, which indicates a deadlock somewhere, hence the presence of semaphores and mutexes in the first place I guess. |
Yes, my app is indeed single-threaded, but isn't that what P.S. I'm running |
This reverts commit b0ff584.
Now I see why |
For context: this is where the original issue was brought up in quamash: harvimt/quamash#55 |
What quamash issue describes is also applicable to qasync: "Event polling is implemented using a separate thread": Not a simple change after all :) I am totally opened to changes that can improve this. Edit: on windows in python IocpProactor is used by default, as opposed to Selector on unix |
For another example of what (I believe) is a thread safety issue with qasync around windows IO, see m-labs/artiq#1216 |
This is odd, this time |
I think it also affects test_reader_writer_echo and test_add_reader_replace tests. I will take a closer look tomorrow. Thank you for digging further into this! |
Thank you for your dedication :) |
Hi @hosaka , could you approve the workflow, or enable the automatic workflow trigger for me? |
Approved. I unfortunately don't have enough privs to modify workflow settings for the repo. |
Would you mind fixing that def test_loop_callback_exceptions_bubble_up(loop):
"""Verify that test exceptions raised in event loop callbacks bubble up."""
def raise_test_exception():
with pytest.raises(ExceptionTester):
raise ExceptionTester("Test Message")
loop.call_soon(raise_test_exception)
loop.run_until_complete(asyncio.sleep(0.1)) |
No problem, I added the pytest marker to May I ask you when the new version with these fixes would be possible? |
It's getting a bit late here. I'll merge this now and will mark a new release tomorrow, it will be pushed to pypi automatically. Other flaky tests can be fixed at some point later. |
This is an incorrect change. Pushing out a release would be a bad idea; it would be completely broken on Windows. I'm a bit surprised that this was merged without deliberating why those locks might be there in the first place, especially seeing as there was recently a bug triggered exactly by the absence of locks (#87). What might already address the performance issues is to not hold the lock while in In general, an alternative would be to replace the background polling thread by a tighter integration into the Qt event loop. This is completely possible using IOCP, but would be considerably more complex, especially from Python. As long as polling is done on a background thread, though, the synchronisation between event submission and event completion will require careful attention. |
Could you elaborate on what the bug was? Was it a race condition? I'm asking because the new version was working fine in a quite complex Python project of mine, very stably: Also, all the CI(except for the one disabled due to a platform limitation) were passing. EDIT: PR>CI |
Yes; in particular, the poller thread getting a completion event and attempting to
Without #87, https://github.com/m-labs/artiq (predictably) breaks in all sorts of wonderful ways, especially on Python 3.10+. I submit that "I tried, and it works for me" in general is not a good development methodology for multi-threaded code. |
(You might want to try whether dnadlinger@d59585a is enough to fix your performance issues.) |
Thanks, I actually was talking about mine and the CI workflows. I wasn't trying to only say "It worked fine for me". For the fix you mentioned, I will give it a try soon :) But for now, I do believe we have to start fixing the bugs you mentioned from here. Wrapping each task in a mutex shouldn't be done, for the sake of performance. |
Apologies, that perhaps came across differently than intended. What I was trying to say is that multi-threaded code is of course notoriously tricky to exhaustively test, so trying to establish correctness by running a single test case, however "complex", is not advisable. It is a fact that this change completely breaks qasync on Windows (introduces a major correctness issues), as evidenced by e.g. ARTIQ. As mentioned, this is due to a race condition between the submission of the event (e.g. To phrase it a different way: Why else do you think the original authors of this went to the trouble of reaching into the Whether your particular application appears to work or not is immaterial for this, as it might simply happen to enforce the ordering the locks are required for another fashion. For instance, if all the I/O operations in your particular program take some time to complete (e.g. involve network or disk requests), it might be that this delay is already long enough to mask any potential issues. As I mentioned above, barring a complete rewrite to remove the background polling thread, the fix is precisely to put back the locks that were removed in this PR. By the way, those locks don't acutally "wrap each task in a mutex" in the sense that the lock would be held during the actual I/O until the operation is completed. Rather, they are just held during the submission of the event to IOCP (and then on the other side while the result is fetched and posted to user code as a Qt signal). I can well imagine, though, that even this led to bad latency characteristics in your program, as without something like dnadlinger@d59585a, the lock would mostly be held by the poller thread waiting up to 10 ms for a timeout in Again, this change should quickly be reverted, and if a release has already been pushed out, it should quickly be followed up by a patch release as well. Otherwise, this is going to cause chaos, as e.g. a good part the projects I'm involved in will start randomly crashing on Windows, which will reflect negatively not just on |
I understand your point. Do you mind adding a pytest for those correctness issues? If not, could you point out to the sample code in ARTIQ that might have correctness issues, or provide a minimal reproducible example code, if possible? |
There's a serious performance loss on Windows. A lot of frames are being dropped, and a lot of tasks are being missed by many seconds. In contrast, on macOS, the performance is fine.
I kept wondering about this slowness on Windows, until I dove into the source code of
qasync
. Then I found out that onlyqasync
on Windows had those mutexes being acquired on every task switch.I tested this PR code with my project(which involves many concurrent tasks) and checked that the performance drastically improves. With this PR, everything feels super smooth on Windows, just like macOS.
I did see PR #87, but wrapping all the tasks(or any kind of resource) with mutex is generally not a good idea in terms of perfomance. This tendency applies to other languages as well.
Source code(
async-first
branch):Before this change(Windows):
After this change(Windows):
Take a look at how the numbers changed:
The numbers are telling us that event loop's much faster in real-life situations without mutexes. These mutex and semaphore turned out to be very expensive when switching tasks.
macOS version of
qasync
doesn't use mutex, neither does originalasyncio
. So whyqasync
on Windows?