-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MQTT disconnects every minute, re-publishes all data on connect #2840
Comments
The connection timeout reported is probably the timeout of the broker, not the ESP side. |
As far as I understand mosquitto & MQTT, the client decides the keepalive timeout. The mosquitto broker appears able to put a limit on it, in MQTT v5+ only, but not to fix it. From what I can find, its default timeout seems to be 60 seconds; that should result in clients being cut off if they've not sent a ping for 90 seconds (spec calls for a +50% tolerance). The client could request any timeout up to 18 hours in its initial connection request. From the MQTT spec:
Here's a snippet from my mosquitto log file; interesting to see what the broker is seeing:
The socket error is interesting - suggests that the ESP has either dropped the connection or had some other problem with it. It seems to happen with this ESP, 42 seconds after the connection is first made. I'm not sure what's going on here. If the broker timeout is 60s, the client end is 10s and it's the client's responsibility to initiate the ping, then we should never get anywhere close to the 90s broker limit. Maybe the broker is trying to send the ping response message when it hits that socket error. If you can think of how to get any more information about what's going on at the ESP end to cause that socket error, I'm happy to try some more debugging... |
Can you keep sending a ping from any host on the network to this ESP? |
I set a 2-second repeating ICMP ping running from the server to that ESP and the MQTT problems continued; the ICMP ping was fine, no packets dropped at all. Also, I have the web UI of the ESP open while I'm doing this, watching its log, which is working fine. It looks like it's not a generic network issue, rather something MQTT-specific. Sending a MQTT PINGREQ from anything else won't be possible, as MQTT is client-to-server. I will try subscribing the ESP to some random topic and using mosquitto_pub to spam that topic every 5 seconds, see if that keeps it alive. How do (can?) one convert the ESP log timestamps into real times? If I could line up the broker and ESP logs, we might be able to see which disconnects first (I suspect the ESP as it says 'timeout', but I think we need to confirm that). |
Thought I'd try it from the ESP side first...
|
Re-flashed the ESP with the 20191208 test binary to get the MQTTImport plug-in. With MQTTimport configured, but nothing publishing its topic, we get disconnections roughly every 30 seconds! If I spam the MQTTimport topic from another machine, the ESP is still disconnecting every 30s, even with spamming every 5s. Analysis:
|
[Sorry if you get repeated notifications of edits... trying not to make the thread any longer than necessary!] Test broker worked fine, no disconnections and the PINGREQ/PINGRESP pairs are there. In the end, I had to run mosquitto in the foreground on the live system with debugging on; not pleasant, but I think it may have exposed what's causing the problems. On my live server, I saw the socket errors; they were coming after a barrage of domoticz/out messages being sent to the ESP - sometimes as many as 20 or 30 messages in a couple of seconds. I think that what's happening, is that the ESP is simply being swamped and either the MQTT client crashes out or simply stops responding, causing the server to see the ESP as unavailable. I've found a workaround, I think: if the ESP doesn't need to subscribe, then set its subscription topic to 'notinterested'. This needs a restart of mosquitto, as the ESP doesn't seem to unsubscribe and disconnect cleanly when its config is changed, so otherwise the broker remembers its old subscription. Fortunately, this applies to most of my ESPs :-) Does the 'swamped' hypothesis sound correct, TD-er? If so, is there anything we can do with the ESPEasy controller parameters (timeout, queue, etc) to lessen its impact for ESPs that do need to subscribe to domoticz's spammy output? I've also suggested to the domoticz developers, that its protocol needs to be changed to include the 'idx' in the MQTT topic - e.g. 'domoticz/out/999' - so that devices have the option to subscribe more precisely and reduce the network traffic. |
Latest versions of ESPEasy are very stable in my opinion. |
You don't appear to be using domoticz. Like I said, the issue appears (subject to TD-er, who knows more than either of us, confirming my hypothesis) to stem from the ESP's inability to cope with the volume of MQTT traffic created by domoticz's broadcast approach. |
Interesting; potentially a number of things combining to overload the ESP then. My 'spammer' script caused the ESP to drop when it was sending around 25~30 messages in a second. They weren't huge messages, about 50 bytes. It didn't have to be sustained; one second like that was enough to break things. MQTT keepalive timeout, FYI, is set by the client; mosquitto has no control over it. I do have a lot of sensors - screenshot below is just the temperature page; utility page is even longer - so high message volume is quite likely if several reporting intervals coincide. Some of my ESPs are publishing 4 or 5 sensors each time; I guess I could adjust the reporting intervals of each sensor by a few seconds +/-, might help there. I changed all my ESPs to subscribe to a dummy topic where possible and they're happy little things now; with less in the log, I then spotted that devices running Tasmota were suffering in the same way... so the problem is something that ESP and Tasmota have in common (which I understand is quite a lot) - buffer space definitely plausible, either at the MQTT or the TCP level. I'll pester the domoticz dev's a bit about their spammy messaging approach; I think that's inefficient on large systems and unwise for small-CPU IoT devices, so I'd like them to think about some small improvements to it. Do you want to close this issue, or keep it open for investigation of what's breaking at the ESP end and whether the performance could degrade somehow more gracefully? For what it's worth, I think that a feature request to make the keepalive timeout configurable would be useful; and also the issue of not unsubscribing when the ESP MQTT config is changed needs a fix (though low-priority). I have a couple of other ideas, which I'll raise if they've not already been raised... and send a few coffees over via Patreon! |
Well the subscribing itself should not really be an issue as they are not really causing a lot of memory allocations. Tasmota is using our version of PubSubClient, so there should not be a lot of difference in there. |
I had the same issue. I changed the mqtt settings in Domoticz to: Hierarchical - publish outgoing messagen on topic domoticz/out/${floorplan name}/${plan name} |
Hmm did not know about this feature of Domoticz. What do you use as floor plan name and plan name? |
I don't think it matters what names you use. Just create a room/plan (I used "Tuin" (Dutch for garden)) and then select the devices you want in that room/plan, then create a floor plan (I used "Lunar") and put that room in your floor plan. That's all there is to it. Interesting is that when you choose Combined in Domoticz, the Disconnections appear again. I had not expected this. |
Hmm that's strange indeed. I have not seen disconnects, but it may have to do with the ESP subscribing to the same topic as all nodes do which may cause a lot of traffic. |
Perhaps it is also worth adding that I use 38 shelly devices that all communicate via mqtt. Each device has more than one IDX. A total of 185 IDXs are shouting on the mqtt channel. I can imagine a very compact ESP28 device having trouble evaluating all those messages. |
Do you also have logging enabled to either a log server or serial (or both) ? |
I'm sorry. I don't have any logging enabled. Only log I use is the standard log in de Web-UI. This ESP is my first and only one and I only got it up and running for three days now. I'm happy to provide some logfiles but I have no clue how to set it up (log server). |
Nope, that's fine. |
OK, seems that we should update the Domoticz controller(s) to support the new hierarchical structures. However I do not understand why you can now use "Index" since ESPEasy will not publish messages like this. |
That setting is for publishing Domoticz's outgoing messages only. So that's why published messages from ESP to Domoticz will keep working. The solution only worked for 1 ESP. My 'ESP_Easy_WATERMETER_1' has stable connection, my 'ESP_Rfid_reader_2' is still reconnecting every approx. 20 sec. Both ESP's have (almost) same config. mosquitto.log
On the manpage I read 'Only the first controller can be a MQTT capable controller.' What is meant by 'controller'? The configured controller in the ESP? Or the node-number of the ESP? (Since my 'ESP_Easy_WATERMETER_1' has node 1 and is working...) |
The controller in ESPEasy. |
It seems the ESP looses its connection every 20 seconds. |
'Almost same config', but different builds...
|
What is the Client Timeout in your controller settings on ESPEasy? Does this behavior change when you start sending pings from whatever machine in your network to the ESPEasy node that reconnects? Can you enable the TimingStats on the Tools-Advanced page of that reconnecting node and let it run for a few minutes and then visit the TimingStats page (button on Tools page)? |
What interests me most is why only ESPEASY and not the other devices (like tasmota)
Log from 09:00 - xx (publish timer at 30 sec)
|
Maybe Tasmota is using a shorter keep-alive setting? |
Mmm 15?
v2.7 ESPEasy |
It might be interesting to follow this PubSubClient issue: knolleary/pubsubclient#1045 🤔 |
Maybe related to some QoS flags? Maybe Domoticz recently changed its QoS flags when sending messages? Maybe there is more memory reserved on the 'working' docker image for delivering messages to subscribers? |
The working docker has no active domoticz attached to it :) But question why would my espeasy (jablotron) be interested in domoticz/out ?? It only publishes to different topics and domoticz/in |
The Domoticz MQTT controller subscribes to it so it can receive messages sent by Domoticz. However it was a rather poor scaling design decision of Domoticz as you all publish to the same topic and subscribe to the same topic. |
Or you can check the checkbox "Unique Client ID on Reconnect:" or if your setup doesn't allow for unique client IDs, you could try "Clean Session:" |
The theory still stands. No Timeouts after modification. (Mod is to re-add domoticz mqtt and the non existent topic) So it could be the issue that:
Log after modification! No Timeouts
|
And what if you check the checkbox to start with a clean session? |
But why should you unsubscribe right after a new login? |
I think it might have something to do with the fact that the client can be offline for a while (battery operated e.g.) and then when it wakes up it receives all the messages it 'missed' during its sleep. But hey I'm not an MQTT expert. |
OK, that last one may actually be a bug, probably in the message queue. How do you switch between MQTT controllers? |
I think technically it is possible to have both enabled -> Red warning, then disable the other one -> no error/warning shown. But it may perhaps not clear the MQTT broker connection if you do this. |
Ok I have the mosquitto log running (tail -f): I Disable the Domoticz MQTT and press Submit.
I move over to openHab MQTT and enable it. And Press Submit
We notice a workable situation:
Just a test with unchecked 'Unique Client ID on Reconnect:'
Works just as well. Reading through this: |
Domoticz MQTT this way: Maybe how Tasmota Solved it. arendst/Tasmota#20299 (comment) In versions before v12.0.1 this lead to a lot of processing before finally determine that the message was not for this specific device. After v11.1.0 an additional detection was introduced to allow early bail-out of irrelevant Domoticz calls. This detection only allows for a topic as defined in DOMOTICZ_OUT_TOPIC which is default domoticz/out._ But I assume that only works if domoticz/out is not flat but domoticz/out/123 |
Would it be a similar case with this one: #4996 (comment) |
It might be. |
The problem isn't about the amount of subscribed topics as that's sent only once and dealt with by the broker. Like I said, the design choice made by Domoticz back in the day to have a single topic for all messages sent to Domoticz and one for messages sent by Domoticz isn't really a scalable choice. So if you only receive data meant for the single node receiving it, then I don't see how it will become an issue. However if you receive > 10 messages a second then it will become an issue for WiFi connected devices as you typically will only receive data once every 102.4 msec. One trick to keep the WiFi radio of an ESP device more active is to continuously send a ping to it from some other host. (not a |
I think I found a workaround that will not break current integrations for current Domoticz users that have domoticz/out everywhere in more powerful hardware integrations. The answer I think is in here: https://www.domoticz.com/wiki/MQTT#Publish_Topic "If Domoticz needs to send to more topics add a new MQTT Client Gateway for each topic" And under the new (additional) MQTT Client Gateway configuration: Publish Topic:
Under Topic In Prefix:
Now you should be able to configure ESPEasy to only subscribe to /domoticz/out/{idx} or /domoticz/out/{name} without breaking existing integrations and losing MQTT connection. Homeassistant already publishes in the following way, so for espeasy that would be no issues.
|
I think the best way would be for ESPEasy to support those newly added structures. |
I created two additional MQTT brokers in domoticz 1 based on Publish Topic: NAME the other on Publish Topic: IDX and monitored the messages using MQTT Explorer. domoticz/out/Verlichting Keuken (A)
domoticz/out/1973
|
Can you test the latest testbuild to see if it is now fixed? (or at least working better/more stable) |
Checklist
I have...
ESP_Easy_mega-20181001_test_ESP8266_4096_VCC.bin
)Steps already tried...
.bin
files are included in the ZIP)Summarize of the problem/feature request
MQTT disconnects every minute, then immediately reconnects and subscribes to the topics (in this case, domoticz).
ESP publishes all sensor data after MQTT connects, causing the sensor to report every minute - excessive.
Need to make the MQTT timeout user-configurable, I think, or at least make it much longer. Why so short? MQTT can stay connected for hours with no problem, if it sends heartbeats occasionally. It should only disconnect if told to, or if it loses the connection.
System configuration
Wifi AP 3m from the ESP, good signal. Zero wifi reconnects shown on status page.
MQTT broker is mosquitto, server (cabled) on same subnet as ESP.
ESP Easy version:
Replicated also on build 20191208; exactly the same behaviour. Replicated on three different ESP modules.
Rules
Rules on one ESP are minimal - just a timer to reboot if wifi is disconnected for 5 mins. Rules on the other include 7-segment display of sensor values.
Log data
The text was updated successfully, but these errors were encountered: