Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recv_into fix for esp32spi socket implementation #168

Closed

Conversation

aaftonbladet
Copy link

Resolves #148 and #165

The esp32spi socket implementations of recv and recv_into do not raise an OSError if the read times out, deviating from the behavior in CPython. In the past we accounted for this by checking whether anything was read when calling recv, if not, raise an OSError ourselves. However, in c334c81 we started to utilize recv_into rather than recv, if it was available. The new version does not have the "raise OSError ourselves if nothing was read"-workaround, because it was assumed that esp32spi wouldn't have recv_into and the old version would be used.

Ideally, this would be fixed in esp32spi for both functions. However, I think they'd be hesitant to except such a radical change - it'd likely break projects relying on the old implementation. Hence I believe we should take it upon ourselves to fix it.

Maybe it's also appropriate to entirely drop support for socket implementations that lack recv_into. Our alternative implementation utilizing recv is labled "esp32 fix", but esp32spi sockets have had support for recv_into for more than a year. I did not include this change in my PR, but I think it should be considered.

@aaftonbladet aaftonbladet changed the title recv_into fix for esp32spi socket implemenetation recv_into fix for esp32spi socket implementation May 23, 2023
@tekktrik
Copy link
Member

After rerunning the CI, it looks like pre-commit wants to reformat the code. You can fix this issue by running pre-commit locally and pushing the changes. You can find information on installing and using pre-commit here: https://learn.adafruit.com/creating-and-sharing-a-circuitpython-library/check-your-code

@tekktrik tekktrik requested a review from brentru May 29, 2023 14:51
@brentru
Copy link
Member

brentru commented May 30, 2023

@vladak As the author of c334c81, are you able to take a look?

# If no bytes are waiting, raise an OSError for good measure.
# Some implementations of recv_into do this on their own (like CPython),
# but we should be prepared if one doesn't (looking at you esp23spi...)
raise OSError(errno.ETIMEDOUT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaftonbladet Could you clarify why this is being raised here instead of within the ESP32SPI implementation on L1141? We already raise this exception on L1147.

Also - if you're intending to replace recv with recv_into, why is this segment of code modified instead of L1143?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "esp32spi implementation" is just a fallback if the socket object doesn't have recv_into, as old versions of esp32spi didn't have recv_into. Check is on L344. Since recv_into now exists in esp32spi, the regular implementation is used instead of the backwards compatible one. The regular one doesn't check that data was actually read - it relies on the callee raising an OSError. Compare that to the backwards compatible one which was made with esp32 in mind, it raises an OSError, because esp32spi doesn't.

My patch makes the behavior consistent in both implementations, we check ourselves that data was read, if not, throw an OSError as the callee should've.

if you're intending to replace recv with recv_into

recv is left as a fallback if the socket object doesn't have recv_into. I suggested dropping the recv/"esp impl" entirely as esp now has recv_into, but I didn't want to pull the trigger on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I understand. Thank you for clarifying.

recv is left as a fallback if the socket object doesn't have recv_into. I suggested dropping the recv/"esp impl" entirely as esp now has recv_into, but I didn't want to pull the trigger on that.

Would you be interested in adding this to the PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of this a bit more, I am not convinced this is the way to go. Firstly, this adds yet another exception to the surface which the library consumers will have to deal with, at least in this code path. If I am not mistaken, the library does not catch OSError in the callers of _sock_exact_recv(). Secondly, this seems to trump the library receive timeout. If the socket times out, the library should probably wait up to recv_timeout; this ordering (recv_timeout > socket_timeout) is actually enforced in __init()__.

Lastly, I don't immediately see how exactly this fixes the the stuck loop() in #148.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank's @vladak for the input. Will wait for @aaftonbladet 's response before moving forward.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, this adds yet another exception to the surface which the library consumers will have to deal with, at least in this code path. If I am not mistaken, the library does not catch OSError in the callers of _socket_exact_recv().

All callers of _socket_exact_recv() should expect it to throw. As for whether it's an OSError, from what I can see, we filter exceptions only in one place, and that's around L857 in _wait_for_msg(). Depending on whether there's a custom exception class for timeouts in the socket module, we catch that one as opposed to the generic OSError.

With that, a bug could arise if the socket module has its own timeout class. We'd try to catch that one instead of the generic OSError. In comes my patch, where it's possible that an OSError is still thrown, which then wouldn't be caught. Maybe we can check that there's no custom timeout class before throwing the error? If there's a custom timeout class, it's likely the underlying socket will handle timeouts itself rendering my patch redundant anyhow.

And yes, it's possible library consumers would be exposed to a new type of exception. But who's this an issue for? If you call _socket_exact_recv right now and you're not catching exceptions, you'll crash. If you catch only OSError, my patch is using the same exception. If you catch the socket module's timeout exception, you're on a platform where we, after implementing the check I proposed in the previous paragraph, aren't introducing any new behavior.

Secondly, this seems to trump the library receive timeout

The socket implementation is told to use the given receive timeout; search for settimeout in the code. If there's an issue with the receive timeout not being properly set, it already exists and my code won't affect them. Though I must say it is strange. My esp32spi socket, despite the MQTT receive timeout being zero, still times out after one second. Maybe there's bug at play.

Lastly, I don't immediately see how exactly this fixes the the stuck loop() in #148.

Right, sorry for not clarifying. The issue lies in us going past L966 because no exception is thrown, and then eventually into the loop at L971. Then we're stuck in that loop for 60 seconds, which is consistent with user reports. To compensate for recv_into() not throwing on esp32spi, we throw the exception ourselves which keeps us from getting stuck in the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When traversing the code submitted to various issues, there is often the following pattern:

    try:
        io.loop()
    except (MMQTTException) as e:
        print("MQTTException: \n", e)
        time.sleep(300)
        continue

so the OSError will cause trouble for these. Personally, I don't see a reason for the OSError exception to shoot through as it is implementation detail of the MQTT library.

Looking at the loop() implementation, it will cause a problem there too. The while cycle seems to be designed around the receive timeout, yet it is strange how the function deals with timeouts in general. It overrides the socket timeout, which I don't think it should do (it also avoids the socket timeout < receive timeout check). It cycles up to receive timeout and it seems to me that the intention is that the timeout argument is meant as the overall wait time, no matter what socket timeout is set to.

If socket timeout in CPython raises OSError, this will propagate via _wait_for_msg() to loop(), breaking this behavior.

I think that the timeout argument for loop() should override the receive timeout and the function should avoid mingling with the socket timeout.

Someone noticed that too: #68 (comment)

If we go with this change, I think the _wait_for_msg() needs to catch the OSError, to make both if branches uniform (return None on socket timeout):

# CPython socket module contains a timeout attribute
if hasattr(self._socket_pool, "timeout"):
try:
res = self._sock_exact_recv(1)
except self._socket_pool.timeout:
return None
else: # socketpool, esp32spi
try:
res = self._sock_exact_recv(1)
except OSError as error:
if error.errno in (errno.ETIMEDOUT, errno.EAGAIN):
# raised by a socket timeout if 0 bytes were present
return None
raise MMQTTException from error

@vladak
Copy link
Contributor

vladak commented May 31, 2023

However, in c334c81 we started to utilize recv_into rather than recv, if it was available.

Before that change, recv_into() was already used (in that if branch); my change merely made sure that the return value of recv_into() was used in order to read the full contents (as specified by the bufsize argument). Also, the if condition itself was not changed, merely the contained block of code.

@@ -1119,6 +1119,12 @@ def _sock_exact_recv(self, bufsize: int) -> bytearray:
rc = bytearray(bufsize)
mv = memoryview(rc)
recv_len = self._sock.recv_into(rc, bufsize)
if recv_len == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some other case than timeout that could lead to the 0 being returned from recv_into(), regardless of the underlying implementation ? (ESP32, CPython, ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if we are not sure about that, it might be prudent to augment the recv_len == 0 check to verify this is actually esp32spi.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the underlying implementation, so I wouldn't know. We could add check that receive_timeout seconds have passed before raising the exception, but that would require us to correctly define the receive timeout. As I mentioned earlier, even though the timeout passed to socket.settimeout is zero, my esp32spi socket waits a second before timing out.

@@ -1119,6 +1119,12 @@ def _sock_exact_recv(self, bufsize: int) -> bytearray:
rc = bytearray(bufsize)
mv = memoryview(rc)
recv_len = self._sock.recv_into(rc, bufsize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another call to recv_into() in the while cycle below (L1134). Does it need same treatment ? (if we can agree on the treatment being correct)

Copy link
Author

@aaftonbladet aaftonbladet Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say no, it doesn't need it. The first check is just to see if there's any data pending in the socket. When we reach the while loop, we already know we have things to read.

@aaftonbladet
Copy link
Author

Before that change, recv_into() was already used [...]

I might've gotten the exact commit mixed up, it could've been d60ab23. I tested my way back and forth until I found which version it broke in.

@vladak
Copy link
Contributor

vladak commented Jun 14, 2023

I might've gotten the exact commit mixed up, it could've been d60ab23. I tested my way back and forth until I found which version it broke in.

I think you'd need to dig even further to history. If I remember correctly, the duplicate definition removal did not have any effect, it merely removed dead code.

@alison-gravley
Copy link

Is this fixed yet? I am trying to use a Matrix Portal M4 and I can't really update the display if the loop() is blocking all the time, especially if I am not getting messages very often. I messed around with asyncio a little, but it looks like it hangs the whole thing either way.

@FoamyGuy
Copy link
Contributor

@alison-gravley I believe if you are using the absolutely latest version of this library, and the absolute latest version of the ESP32SPI library in conjunction with each other the loop() function is no longer blocking indefinitely under those new versions.

My understanding is that the changes from this PR won't necessarily be needed, because we instead made changes inside of ESP32SPI to bring it's behavior more inline with CPython sockets. So this could pototentially be closed at this point unless I've misunderstood some part of it.

I cannot speak to asyncio support, I did a fair amount of testing when preparing to merge / release the latest changes across many different devices and use-cases, but asyncio is something I didn't touch at all during testing, I'm not sure how it interacts with loop() or any other functionality.

@alison-gravley
Copy link

@alison-gravley I believe if you are using the absolutely latest version of this library, and the absolute latest version of the ESP32SPI library in conjunction with each other the loop() function is no longer blocking indefinitely under those new versions.

My understanding is that the changes from this PR won't necessarily be needed, because we instead made changes inside of ESP32SPI to bring it's behavior more inline with CPython sockets. So this could pototentially be closed at this point unless I've misunderstood some part of it.

I cannot speak to asyncio support, I did a fair amount of testing when preparing to merge / release the latest changes across many different devices and use-cases, but asyncio is something I didn't touch at all during testing, I'm not sure how it interacts with loop() or any other functionality.

I should be on latest release of CircuitPython, 8.2.0. Do I need to change to a bleeding edge? I am not sure where I find the version of esp32spi.

From lib folder:
adafruit_minmqtt v7.4.0

boot_out.txt says:
Adafruit CircuitPython 8.2.0 on 2023-07-05; Adafruit Matrix Portal M4 with samd51j19
Board ID:matrixportal_m4

info_uf2.txt:
UF2 Bootloader v3.15.0 SFHWRO
Model: Matrix Portal M4
Board-ID: SAMD51J19A-MatrixPortal-v0

@FoamyGuy
Copy link
Contributor

@alison-gravley Ahh, I forgot the detail that esp32spi is frozen into the build. I think there has not been an update in the core to bump to the newest version so the frozen in one is still on the prior release. Which does make it a little bit trickier, but still do-able.

For now the easiest way to get it working is get the latest release of esp32spi from the assets downloads on this release page: https://github.com/adafruit/Adafruit_CircuitPython_ESP32SPI/releases/tag/6.0.0 get the 8.x mpy zip, download and extract it. Inside find the lib folder, and inside of that should be the adafruit_esp32spi/ directory. Copy that directory to the root of your CIRCUITPY drive (unlike normally when you'd put it inside of lib this needs to be specifically in the root directory to "override" the frozen version from the core build).

So your drive should look something like this:

CIRCUITPY/
    adafruit_esp32spi/
    lib/
        adafruit_minimqtt/
    boot_out.txt
    code.py

Putting the latest esp32spi in the root of CIRCUITPY will make it so that this version gets used instead of the one frozen in. At some point the core one will get updated to the latest and then this process will no longer be necessary, but I think for now it is.

@alison-gravley
Copy link

@FoamyGuy, thank you so much! I finally have scrolling text and can move on to fun animations.

I actually tried all of this in Arduino first, but ran into a ton of issues between adafruit libraries and changes with esp32 breaking things. I had to do some hacky things to even get the project to compile and run, but I couldn't quite get it to subscribe to the broker without getting an error.

Off-topic here, but I keep hitting a lot of bad luck on when I dig these esp32-s2/s3 boards out. Every time I think it is fun to play with such and such library I run in to a problem and find a recent GitHub post explaining the issue. With both Circuit Python and Arduino. I am just thankful all of the developers have been great and I appreciate all of you for helping your end users who have a mish-mash of capabilities and experience levels.

@toddkrein
Copy link

toddkrein commented Aug 11, 2023

Well, three days down the rabbit hole after being bitten by this on my PyPortal. At least it was educational. Pulling the latest esp32spi and putting it at the top level of CircuitPython has fixed the problem. What's the next step to encourage pulling the updated esp32spi into the CircuitPython release itself?

update: 8.2.3 still doesn't have the updated esp32spi library

@dwadler
Copy link

dwadler commented Oct 24, 2023

This was painful.
After moving my esp32-s2 MQTT app to the PYPortal, I spent hours pulling my hair trying to figure out why loop() was hanging, trying the published AdaFruit examples which hung similarly.
I finally found this issue and thanks to the suggested by "Foamguy" above, got it to work. (I was about to start poking around in the mini_mqtt source).
Isn't this going to be broken for anyone trying to run the samples on PYPortal with the current code levels?

@toddkrein
Copy link

This was painful. After moving my esp32-s2 MQTT app to the PYPortal, I spent hours pulling my hair trying to figure out why loop() was hanging, trying the published AdaFruit examples which hung similarly. I finally found this issue and thanks to the suggested by "Foamguy" above, got it to work. (I was about to start poking around in the mini_mqtt source). Isn't this going to be broken for anyone trying to run the samples on PYPortal with the current code levels?

Wow, you solved in hours... Better than my several days.
Yes, it will be an issues for everyone using this level.

@justmobilize
Copy link
Collaborator

@aaftonbladet and @dhalbert - I might recommend closing this, and if the esp32spi should raise on a condition like this, a PR should be added there so we don't have special case code in this library

@FoamyGuy
Copy link
Contributor

Closing this for now. I believe it should not be an issue any more. I think the frozen in versions were updated a while ago to support the new way of working.

If anyone else runs into this on newer versions we can re-open and dig further again.

@FoamyGuy FoamyGuy closed this Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

minimqtt blocking on .loop() method
9 participants