Fix the slow performance on Windows #101

temeddix · 2023-11-14T17:04:31Z

There's a serious performance loss on Windows. A lot of frames are being dropped, and a lot of tasks are being missed by many seconds. In contrast, on macOS, the performance is fine.

I kept wondering about this slowness on Windows, until I dove into the source code of qasync. Then I found out that only qasync on Windows had those mutexes being acquired on every task switch.

I tested this PR code with my project(which involves many concurrent tasks) and checked that the performance drastically improves. With this PR, everything feels super smooth on Windows, just like macOS.

I did see PR #87, but wrapping all the tasks(or any kind of resource) with mutex is generally not a good idea in terms of perfomance. This tendency applies to other languages as well.

Source code(async-first branch):

https://github.com/cunarist/solie/tree/async-first

Before this change(Windows):

After this change(Windows):

Take a look at how the numbers changed:

Minimum: Dropped to 1/30.
Middle(Median): Dropped to 1/6.
Maximum: Please ignore this since other heavy tasks were blocking, thus slowing down once at startup.

The numbers are telling us that event loop's much faster in real-life situations without mutexes. These mutex and semaphore turned out to be very expensive when switching tasks.

macOS version of qasync doesn't use mutex, neither does original asyncio. So why qasync on Windows?

temeddix · 2023-11-15T06:39:53Z

I'm open to changes, so please tag me if you see this @hosaka . I will respond as fast as possible, since my project on Windows needs this.

hosaka · 2023-11-15T07:53:07Z

Hi @temeddix , thanks for the PR. I am ok with removing the unnecessary locking, this looks like something that was brought over from quamash at some point in the project. Will this not break multi-threaded Qt apps that use qasync (thread workers etc.)?

At this point, isn't _IocpProactor class redundant? The select() and _poll() overrides basically do the same thing as IocpProtractor in asyncio/windows_events.py. Can we remove the _IocpProactor entierly and change super().__init__(_IocpProactor() to super().__init__(windows_event.IocpProactor()))?

hosaka · 2023-11-15T07:59:26Z

It looks like this caused a number of Windows tests to get stuck.

temeddix · 2023-11-15T08:00:56Z

Yeah, I'm aware of this problem. I'll try removing the _IocpProactor class...

hosaka · 2023-11-15T08:27:49Z

Running pytest --full-trace locally: the the last point that causes the test to hang is:

self = <QIOCPEventLoop running=False closed=False debug=False>

    def close(self):
        """
        Release all resources used by the event loop.

        The loop cannot be restarted after it has been closed.
        """
        if self.is_running():
            raise RuntimeError("Cannot close a running event loop")
        if self.is_closed():
            return

        self.__log_debug("Closing event loop...")
        if self.__default_executor is not None:
            self.__default_executor.shutdown()

>       super().close()
qasync\__init__.py:441:

self = <_IocpProactor overlapped#=1 result#=0>

    def close(self):
        self._logger.debug("Closing")
>       super(_IocpProactor, self).close()
qasync\_windows.py:75:

temeddix · 2023-11-15T08:42:28Z

I completely removed Windows-specific code, which includes that _IocpProactor class.

To me, it looked like there was Windows-specific code because of win32api package dependency, but honestly it didn't look necessary.

temeddix · 2023-11-15T08:46:59Z

I'm not sure why tests are failing... It works properly on my PC. I will check the tests.

hosaka · 2023-11-15T08:54:53Z

Is your application single threaded? Try running pytest on a windows machine with your changes (add the --full-trace to see the callstack after Ctrl-C on a stuck test case). It will pass occasionally, which indicates a deadlock somewhere, hence the presence of semaphores and mutexes in the first place I guess.

temeddix · 2023-11-15T08:57:13Z

Yes, my app is indeed single-threaded, but isn't that what asyncio does? Please let me know if I'm missing something. I'm inspecting the test code right now.

P.S. I'm running pytest to check where the problem is. Thanks :)

This reverts commit b0ff584.

temeddix · 2023-11-15T09:20:28Z

Now I see why win32api was being used. It was because qasync tried to provide someting more than asyncio, including real network IO operations. This is where things are starting to get complicated..

hosaka · 2023-11-15T09:23:54Z

For context: this is where the original issue was brought up in quamash: harvimt/quamash#55
The network connection is part of the asyncio implementation (https://docs.python.org/3/library/asyncio-stream.html) so I guess qasync wanted to remain compatible with this (in test_qeventloop.py: test_reader_writer_echo, test_add_reader_replace and test_regression_bug13)

hosaka · 2023-11-15T09:31:38Z

What quamash issue describes is also applicable to qasync: "Event polling is implemented using a separate thread": _EventPoller/_EventWorker in windows.py. This is a way to sync adding/registering events with IocpProactor. The race condition happens in between. and the the line that hangs is status = _overlapped.GetQueuedCompletionStatus.

Not a simple change after all :) I am totally opened to changes that can improve this.

Edit: on windows in python IocpProactor is used by default, as opposed to Selector on unix

temeddix · 2023-11-15T09:32:14Z

I do want to point out that add_writer and _remove_writer exist in Unix asyncio, but not in Windows asyncio. Frankly speaking, I'm not sure why these hidden functions(starting with underscore) should be used in the tests :|

I will try to use the exposed function(not starting with underscore)s in the tests.

hartytp · 2023-11-15T09:36:05Z

For another example of what (I believe) is a thread safety issue with qasync around windows IO, see m-labs/artiq#1216

This reverts commit abac70a.

temeddix · 2023-11-15T11:03:31Z

After doing many tests and experiments, I had to disable test_regression_bug13 on Windows. This is because of the limitation from Windows:

https://docs.python.org/3/library/asyncio-platforms.html#asyncio-platform-support

As you can see, Python itself doesn't support add_reader and add_writer with ProactorEventLoop on Windows. There is something called HPET, but its precision is not that high.

https://www.xbitlabs.com/how-to-get-better-latency-in-windows/

So we're not very responsible of providing proper _add_reader and _remove_reader on Windows, I believe.

I suggest disabling the regression test only on Windows, if acceptable. Using per-task mutex to go around this problem brings too much of a performance loss.

P.S I couldn't remove _windows._IocpProactor because _windows._ProactorEventLoop seemed to really need it.

temeddix · 2023-11-15T13:02:36Z

This is odd, this time pytest wasn't hanging on my Windows PC. I'll take a look.

hosaka · 2023-11-15T13:09:15Z

I think it also affects test_reader_writer_echo and test_add_reader_replace tests. I will take a closer look tomorrow. Thank you for digging further into this!

temeddix · 2023-11-15T13:17:41Z

Thank you for your dedication :)

temeddix · 2023-11-16T14:08:19Z

Hi @hosaka , could you approve the workflow, or enable the automatic workflow trigger for me?

hosaka · 2023-11-16T14:17:46Z

Approved. I unfortunately don't have enough privs to modify workflow settings for the repo.

temeddix · 2023-11-16T14:51:37Z

That's weird, pytest is passing on my Windows with Python 3.9

Do you mind triggering the workflow once again for me? I don't want to keep asking, but it looks like that's how this repo is set...

In my experience, GitHub workflow could sometimes fail for no good reason.

temeddix · 2023-11-16T14:54:41Z

@hosaka

hosaka · 2023-11-16T14:54:56Z

Would you mind fixing that pytest.mark.raises warning?

def test_loop_callback_exceptions_bubble_up(loop):
    """Verify that test exceptions raised in event loop callbacks bubble up."""

    def raise_test_exception():
        with pytest.raises(ExceptionTester):
            raise ExceptionTester("Test Message")

    loop.call_soon(raise_test_exception)
    loop.run_until_complete(asyncio.sleep(0.1))

temeddix · 2023-11-16T15:02:42Z

No problem, I added the pytest marker to pyproject.toml. Now pytest runs without any warning.

May I ask you when the new version with these fixes would be possible?

hosaka · 2023-11-16T15:10:44Z

It's getting a bit late here. I'll merge this now and will mark a new release tomorrow, it will be pushed to pypi automatically.

Other flaky tests can be fixed at some point later.

dnadlinger · 2023-11-17T11:53:11Z

This is an incorrect change. Pushing out a release would be a bad idea; it would be completely broken on Windows. I'm a bit surprised that this was merged without deliberating why those locks might be there in the first place, especially seeing as there was recently a bug triggered exactly by the absence of locks (#87).

What might already address the performance issues is to not hold the lock while in GetQueuedCompletionStatus, which indeed isn't a good idea (dnadlinger@d59585a).

In general, an alternative would be to replace the background polling thread by a tighter integration into the Qt event loop. This is completely possible using IOCP, but would be considerably more complex, especially from Python. As long as polling is done on a background thread, though, the synchronisation between event submission and event completion will require careful attention.

temeddix · 2023-11-17T11:58:27Z

Could you elaborate on what the bug was? Was it a race condition?

I'm asking because the new version was working fine in a quite complex Python project of mine, very stably:

https://github.com/cunarist/solie

Also, all the CI(except for the one disabled due to a platform limitation) were passing.

EDIT: PR>CI

dnadlinger · 2023-11-17T12:12:56Z

Was it a race condition?

Yes; in particular, the poller thread getting a completion event and attempting to pop from self._cache before the metadata was registered there, leading to the asyncio task never completing.

I'm asking this because the new version was working fine in a quite complex Python project of mine, very stably

Without #87, https://github.com/m-labs/artiq (predictably) breaks in all sorts of wonderful ways, especially on Python 3.10+. I submit that "I tried, and it works for me" in general is not a good development methodology for multi-threaded code.

dnadlinger · 2023-11-17T12:13:40Z

(You might want to try whether dnadlinger@d59585a is enough to fix your performance issues.)

temeddix · 2023-11-17T12:35:16Z

Thanks, I actually was talking about mine and the CI workflows. I wasn't trying to only say "It worked fine for me".

For the fix you mentioned, I will give it a try soon :) But for now, I do believe we have to start fixing the bugs you mentioned from here. Wrapping each task in a mutex shouldn't be done, for the sake of performance.

dnadlinger · 2023-11-17T22:18:53Z

Apologies, that perhaps came across differently than intended. What I was trying to say is that multi-threaded code is of course notoriously tricky to exhaustively test, so trying to establish correctness by running a single test case, however "complex", is not advisable. It is a fact that this change completely breaks qasync on Windows (introduces a major correctness issues), as evidenced by e.g. ARTIQ. As mentioned, this is due to a race condition between the submission of the event (e.g. WSARecv()) and registration of the task details in IocpProactor._register()(writing to IocpProactor._cache) on one hand, and the poller thread calling GetQueuedCompletionStatus() and reading the task details from IocpProactor._cache on the other. As multi-threading issues go, this should be a fairly cut and dry matter.

To phrase it a different way: Why else do you think the original authors of this went to the trouble of reaching into the _IocpProactor internals to add the extra locking? (NB: This was also pointed out above.) Of course, it's always possible that whoever originally wrote this was wrong (in general; here, they weren't), but before just altogether removing some locks, it might be a good idea to trace what the original author's understanding might have been.

Whether your particular application appears to work or not is immaterial for this, as it might simply happen to enforce the ordering the locks are required for another fashion. For instance, if all the I/O operations in your particular program take some time to complete (e.g. involve network or disk requests), it might be that this delay is already long enough to mask any potential issues.

As I mentioned above, barring a complete rewrite to remove the background polling thread, the fix is precisely to put back the locks that were removed in this PR. By the way, those locks don't acutally "wrap each task in a mutex" in the sense that the lock would be held during the actual I/O until the operation is completed. Rather, they are just held during the submission of the event to IOCP (and then on the other side while the result is fetched and posted to user code as a Qt signal). I can well imagine, though, that even this led to bad latency characteristics in your program, as without something like dnadlinger@d59585a, the lock would mostly be held by the poller thread waiting up to 10 ms for a timeout in GetQueuedCompletionStatus().

Again, this change should quickly be reverted, and if a release has already been pushed out, it should quickly be followed up by a patch release as well. Otherwise, this is going to cause chaos, as e.g. a good part the projects I'm involved in will start randomly crashing on Windows, which will reflect negatively not just on qasync, but also its downstream projects. Just to avoid any misunderstandings, this is not a sleight against you or your contributions; it simply doesn't make sense to ship a release with a huge regression that makes the library completely unusable for other users; standard practice in most open-source projects is to "revert quickly" and sort stuff out while keeping the master branch usuable for everyone.

temeddix · 2023-11-18T03:59:03Z

I understand your point.

Do you mind adding a pytest for those correctness issues? If not, could you point out to the sample code in ARTIQ that might have correctness issues, or provide a minimal reproducible example code, if possible?

Remove unneeded mutex and semaphore

b94d5a7

temeddix changed the title ~~Remove unneeded mutex and semaphore~~ Remove unnecessary mutex for performance Nov 14, 2023

temeddix added 2 commits November 15, 2023 02:42

Remove unneeded methods

e7a7b68

Remove more methods

200cba5

temeddix mentioned this pull request Nov 14, 2023

Use async instead of threading cunarist/solie#85

Closed

22 tasks

Remove Windows-specific code

b0ff584

Revert "Remove Windows-specific code"

f1118cf

This reverts commit b0ff584.

temeddix added 3 commits November 15, 2023 19:14

Restore semaphore

abac70a

Revert "Restore semaphore"

60e3f35

This reverts commit abac70a.

Do not perform regression test on Windows

f6f716f

temeddix changed the title ~~Remove unnecessary mutex for performance~~ Fix the slow performance on Windows Nov 15, 2023

Use exec() over exec_()

a39a52d

Fix hanging

3fec18e

Add comments

800d1bb

Register a pytest marker to avoid warning

30c7605

hosaka merged commit c91a915 into CabbageDevelopment:master Nov 16, 2023
35 of 36 checks passed

temeddix mentioned this pull request Nov 18, 2023

Restore mutex without compromising performance, on Windows #104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the slow performance on Windows #101

Fix the slow performance on Windows #101

temeddix commented Nov 14, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

hosaka commented Nov 15, 2023

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023 •

edited

Loading

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023

hosaka commented Nov 15, 2023

hosaka commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

hartytp commented Nov 15, 2023

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023

temeddix commented Nov 16, 2023

hosaka commented Nov 16, 2023

temeddix commented Nov 16, 2023 •

edited

Loading

temeddix commented Nov 16, 2023

hosaka commented Nov 16, 2023

temeddix commented Nov 16, 2023

hosaka commented Nov 16, 2023 •

edited

Loading

dnadlinger commented Nov 17, 2023 •

edited

Loading

temeddix commented Nov 17, 2023 •

edited

Loading

dnadlinger commented Nov 17, 2023

dnadlinger commented Nov 17, 2023

temeddix commented Nov 17, 2023

dnadlinger commented Nov 17, 2023

temeddix commented Nov 18, 2023

Fix the slow performance on Windows #101

Fix the slow performance on Windows #101

Conversation

temeddix commented Nov 14, 2023 • edited Loading

temeddix commented Nov 15, 2023 • edited Loading

hosaka commented Nov 15, 2023

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023 • edited Loading

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023 • edited Loading

temeddix commented Nov 15, 2023 • edited Loading

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023 • edited Loading

temeddix commented Nov 15, 2023

hosaka commented Nov 15, 2023

hosaka commented Nov 15, 2023 • edited Loading

temeddix commented Nov 15, 2023 • edited Loading

hartytp commented Nov 15, 2023

temeddix commented Nov 15, 2023 • edited Loading

temeddix commented Nov 15, 2023

hosaka commented Nov 15, 2023

temeddix commented Nov 15, 2023

temeddix commented Nov 16, 2023

hosaka commented Nov 16, 2023

temeddix commented Nov 16, 2023 • edited Loading

temeddix commented Nov 16, 2023

hosaka commented Nov 16, 2023

temeddix commented Nov 16, 2023

hosaka commented Nov 16, 2023 • edited Loading

dnadlinger commented Nov 17, 2023 • edited Loading

temeddix commented Nov 17, 2023 • edited Loading

dnadlinger commented Nov 17, 2023

dnadlinger commented Nov 17, 2023

temeddix commented Nov 17, 2023

dnadlinger commented Nov 17, 2023

temeddix commented Nov 18, 2023

temeddix commented Nov 14, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

hosaka commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 15, 2023 •

edited

Loading

temeddix commented Nov 16, 2023 •

edited

Loading

hosaka commented Nov 16, 2023 •

edited

Loading

dnadlinger commented Nov 17, 2023 •

edited

Loading

temeddix commented Nov 17, 2023 •

edited

Loading