Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the slow performance on Windows #101

Merged
merged 12 commits into from
Nov 16, 2023
Merged

Fix the slow performance on Windows #101

merged 12 commits into from
Nov 16, 2023

Conversation

temeddix
Copy link
Contributor

@temeddix temeddix commented Nov 14, 2023

There's a serious performance loss on Windows. A lot of frames are being dropped, and a lot of tasks are being missed by many seconds. In contrast, on macOS, the performance is fine.

I kept wondering about this slowness on Windows, until I dove into the source code of qasync. Then I found out that only qasync on Windows had those mutexes being acquired on every task switch.

I tested this PR code with my project(which involves many concurrent tasks) and checked that the performance drastically improves. With this PR, everything feels super smooth on Windows, just like macOS.

I did see PR #87, but wrapping all the tasks(or any kind of resource) with mutex is generally not a good idea in terms of perfomance. This tendency applies to other languages as well.

Source code(async-first branch):

Before this change(Windows):

image

After this change(Windows):

image

Take a look at how the numbers changed:

  • Minimum: Dropped to 1/30.
  • Middle(Median): Dropped to 1/6.
  • Maximum: Please ignore this since other heavy tasks were blocking, thus slowing down once at startup.

The numbers are telling us that event loop's much faster in real-life situations without mutexes. These mutex and semaphore turned out to be very expensive when switching tasks.

macOS version of qasync doesn't use mutex, neither does original asyncio. So why qasync on Windows?

@temeddix temeddix changed the title Remove unneeded mutex and semaphore Remove unnecessary mutex for performance Nov 14, 2023
@temeddix
Copy link
Contributor Author

temeddix commented Nov 15, 2023

I'm open to changes, so please tag me if you see this @hosaka . I will respond as fast as possible, since my project on Windows needs this.

@hosaka
Copy link
Collaborator

hosaka commented Nov 15, 2023

Hi @temeddix , thanks for the PR. I am ok with removing the unnecessary locking, this looks like something that was brought over from quamash at some point in the project. Will this not break multi-threaded Qt apps that use qasync (thread workers etc.)?

At this point, isn't _IocpProactor class redundant? The select() and _poll() overrides basically do the same thing as IocpProtractor in asyncio/windows_events.py. Can we remove the _IocpProactor entierly and change super().__init__(_IocpProactor() to super().__init__(windows_event.IocpProactor()))?

@hosaka
Copy link
Collaborator

hosaka commented Nov 15, 2023

It looks like this caused a number of Windows tests to get stuck.

@temeddix
Copy link
Contributor Author

temeddix commented Nov 15, 2023

Yeah, I'm aware of this problem. I'll try removing the _IocpProactor class...

@hosaka
Copy link
Collaborator

hosaka commented Nov 15, 2023

Running pytest --full-trace locally: the the last point that causes the test to hang is:

self = <QIOCPEventLoop running=False closed=False debug=False>

    def close(self):
        """
        Release all resources used by the event loop.

        The loop cannot be restarted after it has been closed.
        """
        if self.is_running():
            raise RuntimeError("Cannot close a running event loop")
        if self.is_closed():
            return

        self.__log_debug("Closing event loop...")
        if self.__default_executor is not None:
            self.__default_executor.shutdown()

>       super().close()
qasync\__init__.py:441:

self = <_IocpProactor overlapped#=1 result#=0>

    def close(self):
        self._logger.debug("Closing")
>       super(_IocpProactor, self).close()
qasync\_windows.py:75:

@temeddix
Copy link
Contributor Author

temeddix commented Nov 15, 2023

I completely removed Windows-specific code, which includes that _IocpProactor class.

To me, it looked like there was Windows-specific code because of win32api package dependency, but honestly it didn't look necessary.

@temeddix
Copy link
Contributor Author

temeddix commented Nov 15, 2023

I'm not sure why tests are failing... It works properly on my PC. I will check the tests.

@hosaka
Copy link
Collaborator

hosaka commented Nov 15, 2023

Is your application single threaded? Try running pytest on a windows machine with your changes (add the --full-trace to see the callstack after Ctrl-C on a stuck test case). It will pass occasionally, which indicates a deadlock somewhere, hence the presence of semaphores and mutexes in the first place I guess.

@temeddix
Copy link
Contributor Author

temeddix commented Nov 15, 2023

Yes, my app is indeed single-threaded, but isn't that what asyncio does? Please let me know if I'm missing something. I'm inspecting the test code right now.

P.S. I'm running pytest to check where the problem is. Thanks :)

@temeddix
Copy link
Contributor Author

Now I see why win32api was being used. It was because qasync tried to provide someting more than asyncio, including real network IO operations. This is where things are starting to get complicated..

@hosaka
Copy link
Collaborator

hosaka commented Nov 15, 2023

For context: this is where the original issue was brought up in quamash: harvimt/quamash#55
The network connection is part of the asyncio implementation (https://docs.python.org/3/library/asyncio-stream.html) so I guess qasync wanted to remain compatible with this (in test_qeventloop.py: test_reader_writer_echo, test_add_reader_replace and test_regression_bug13)

@hosaka
Copy link
Collaborator

hosaka commented Nov 15, 2023

What quamash issue describes is also applicable to qasync: "Event polling is implemented using a separate thread": _EventPoller/_EventWorker in windows.py. This is a way to sync adding/registering events with IocpProactor. The race condition happens in between. and the the line that hangs is status = _overlapped.GetQueuedCompletionStatus.

Not a simple change after all :) I am totally opened to changes that can improve this.

Edit: on windows in python IocpProactor is used by default, as opposed to Selector on unix

@temeddix
Copy link
Contributor Author

temeddix commented Nov 15, 2023

image

I do want to point out that add_writer and _remove_writer exist in Unix asyncio, but not in Windows asyncio. Frankly speaking, I'm not sure why these hidden functions(starting with underscore) should be used in the tests :|

I will try to use the exposed function(not starting with underscore)s in the tests.

@hartytp
Copy link
Contributor

hartytp commented Nov 15, 2023

For another example of what (I believe) is a thread safety issue with qasync around windows IO, see m-labs/artiq#1216

@temeddix
Copy link
Contributor Author

temeddix commented Nov 15, 2023

After doing many tests and experiments, I had to disable test_regression_bug13 on Windows. This is because of the limitation from Windows:

image

As you can see, Python itself doesn't support add_reader and add_writer with ProactorEventLoop on Windows. There is something called HPET, but its precision is not that high.

So we're not very responsible of providing proper _add_reader and _remove_reader on Windows, I believe.

I suggest disabling the regression test only on Windows, if acceptable. Using per-task mutex to go around this problem brings too much of a performance loss.

P.S I couldn't remove _windows._IocpProactor because _windows._ProactorEventLoop seemed to really need it.

@temeddix
Copy link
Contributor Author

This is odd, this time pytest wasn't hanging on my Windows PC. I'll take a look.

@hosaka
Copy link
Collaborator

hosaka commented Nov 15, 2023

I think it also affects test_reader_writer_echo and test_add_reader_replace tests. I will take a closer look tomorrow. Thank you for digging further into this!

@temeddix
Copy link
Contributor Author

Thank you for your dedication :)

@temeddix temeddix changed the title Remove unnecessary mutex for performance Fix the slow performance on Windows Nov 15, 2023
@temeddix
Copy link
Contributor Author

Hi @hosaka , could you approve the workflow, or enable the automatic workflow trigger for me?

@hosaka
Copy link
Collaborator

hosaka commented Nov 16, 2023

Approved. I unfortunately don't have enough privs to modify workflow settings for the repo.

@temeddix
Copy link
Contributor Author

temeddix commented Nov 16, 2023

That's weird, pytest is passing on my Windows with Python 3.9

image

Do you mind triggering the workflow once again for me? I don't want to keep asking, but it looks like that's how this repo is set...

In my experience, GitHub workflow could sometimes fail for no good reason.

@temeddix
Copy link
Contributor Author

@hosaka

@hosaka
Copy link
Collaborator

hosaka commented Nov 16, 2023

Would you mind fixing that pytest.mark.raises warning?

def test_loop_callback_exceptions_bubble_up(loop):
    """Verify that test exceptions raised in event loop callbacks bubble up."""

    def raise_test_exception():
        with pytest.raises(ExceptionTester):
            raise ExceptionTester("Test Message")

    loop.call_soon(raise_test_exception)
    loop.run_until_complete(asyncio.sleep(0.1))

@temeddix
Copy link
Contributor Author

No problem, I added the pytest marker to pyproject.toml. Now pytest runs without any warning.

May I ask you when the new version with these fixes would be possible?

@hosaka
Copy link
Collaborator

hosaka commented Nov 16, 2023

It's getting a bit late here. I'll merge this now and will mark a new release tomorrow, it will be pushed to pypi automatically.

Other flaky tests can be fixed at some point later.

@hosaka hosaka merged commit c91a915 into CabbageDevelopment:master Nov 16, 2023
35 of 36 checks passed
@dnadlinger
Copy link
Contributor

dnadlinger commented Nov 17, 2023

This is an incorrect change. Pushing out a release would be a bad idea; it would be completely broken on Windows. I'm a bit surprised that this was merged without deliberating why those locks might be there in the first place, especially seeing as there was recently a bug triggered exactly by the absence of locks (#87).

What might already address the performance issues is to not hold the lock while in GetQueuedCompletionStatus, which indeed isn't a good idea (dnadlinger@d59585a).

In general, an alternative would be to replace the background polling thread by a tighter integration into the Qt event loop. This is completely possible using IOCP, but would be considerably more complex, especially from Python. As long as polling is done on a background thread, though, the synchronisation between event submission and event completion will require careful attention.

@temeddix
Copy link
Contributor Author

temeddix commented Nov 17, 2023

Could you elaborate on what the bug was? Was it a race condition?

I'm asking because the new version was working fine in a quite complex Python project of mine, very stably:

Also, all the CI(except for the one disabled due to a platform limitation) were passing.

EDIT: PR>CI

@dnadlinger
Copy link
Contributor

Was it a race condition?

Yes; in particular, the poller thread getting a completion event and attempting to pop from self._cache before the metadata was registered there, leading to the asyncio task never completing.

I'm asking this because the new version was working fine in a quite complex Python project of mine, very stably

Without #87, https://github.com/m-labs/artiq (predictably) breaks in all sorts of wonderful ways, especially on Python 3.10+. I submit that "I tried, and it works for me" in general is not a good development methodology for multi-threaded code.

@dnadlinger
Copy link
Contributor

(You might want to try whether dnadlinger@d59585a is enough to fix your performance issues.)

@temeddix
Copy link
Contributor Author

Thanks, I actually was talking about mine and the CI workflows. I wasn't trying to only say "It worked fine for me".

For the fix you mentioned, I will give it a try soon :) But for now, I do believe we have to start fixing the bugs you mentioned from here. Wrapping each task in a mutex shouldn't be done, for the sake of performance.

@dnadlinger
Copy link
Contributor

Apologies, that perhaps came across differently than intended. What I was trying to say is that multi-threaded code is of course notoriously tricky to exhaustively test, so trying to establish correctness by running a single test case, however "complex", is not advisable. It is a fact that this change completely breaks qasync on Windows (introduces a major correctness issues), as evidenced by e.g. ARTIQ. As mentioned, this is due to a race condition between the submission of the event (e.g. WSARecv()) and registration of the task details in IocpProactor._register()(writing to IocpProactor._cache) on one hand, and the poller thread calling GetQueuedCompletionStatus() and reading the task details from IocpProactor._cache on the other. As multi-threading issues go, this should be a fairly cut and dry matter.

To phrase it a different way: Why else do you think the original authors of this went to the trouble of reaching into the _IocpProactor internals to add the extra locking? (NB: This was also pointed out above.) Of course, it's always possible that whoever originally wrote this was wrong (in general; here, they weren't), but before just altogether removing some locks, it might be a good idea to trace what the original author's understanding might have been.

Whether your particular application appears to work or not is immaterial for this, as it might simply happen to enforce the ordering the locks are required for another fashion. For instance, if all the I/O operations in your particular program take some time to complete (e.g. involve network or disk requests), it might be that this delay is already long enough to mask any potential issues.

As I mentioned above, barring a complete rewrite to remove the background polling thread, the fix is precisely to put back the locks that were removed in this PR. By the way, those locks don't acutally "wrap each task in a mutex" in the sense that the lock would be held during the actual I/O until the operation is completed. Rather, they are just held during the submission of the event to IOCP (and then on the other side while the result is fetched and posted to user code as a Qt signal). I can well imagine, though, that even this led to bad latency characteristics in your program, as without something like dnadlinger@d59585a, the lock would mostly be held by the poller thread waiting up to 10 ms for a timeout in GetQueuedCompletionStatus().

Again, this change should quickly be reverted, and if a release has already been pushed out, it should quickly be followed up by a patch release as well. Otherwise, this is going to cause chaos, as e.g. a good part the projects I'm involved in will start randomly crashing on Windows, which will reflect negatively not just on qasync, but also its downstream projects. Just to avoid any misunderstandings, this is not a sleight against you or your contributions; it simply doesn't make sense to ship a release with a huge regression that makes the library completely unusable for other users; standard practice in most open-source projects is to "revert quickly" and sort stuff out while keeping the master branch usuable for everyone.

@temeddix
Copy link
Contributor Author

I understand your point.

Do you mind adding a pytest for those correctness issues? If not, could you point out to the sample code in ARTIQ that might have correctness issues, or provide a minimal reproducible example code, if possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants