Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT disconnects every minute, re-publishes all data on connect #2840

Open
jefft4 opened this issue Jan 2, 2020 · 117 comments
Open

MQTT disconnects every minute, re-publishes all data on connect #2840

jefft4 opened this issue Jan 2, 2020 · 117 comments
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired

Comments

@jefft4
Copy link

jefft4 commented Jan 2, 2020

Checklist

  • [X ] The title of this issue is "to the point" and descriptive.
  • [ X] This issue describes what is happening.
  • [ X] This issue describes what components are affected (e.g. name of plugin/controller)
  • [X ] This issue describes how to reproduce it.
  • [X ] This issue describes when it was introduced (when known) and what version is now showing the problem.

I have...

  • [X ] searched the issue tracker or the forum for a similar issue. (include links when applicable)
  • [X ] entered a system description using "Copy info to clipboard" on the sysinfo page. (when possible)
  • [ X] entered the full filename of the used version (e.g. ESP_Easy_mega-20181001_test_ESP8266_4096_VCC.bin )
  • [n/a ] given a list of active plugins or controllers when applicable.
  • [ X] filled out all applicable fields below.

Steps already tried...

  • [ X] Tried a clean install (empty .bin files are included in the ZIP)
  • [ X] Tested previous/other build (mention which one already tested)
  • [ X] Tested on other node to make sure hardware isn't defective.
  • [ X] Verified if the problem is limited to a single plugin/controller

Summarize of the problem/feature request

MQTT disconnects every minute, then immediately reconnects and subscribes to the topics (in this case, domoticz).
ESP publishes all sensor data after MQTT connects, causing the sensor to report every minute - excessive.

Need to make the MQTT timeout user-configurable, I think, or at least make it much longer. Why so short? MQTT can stay connected for hours with no problem, if it sends heartbeats occasionally. It should only disconnect if told to, or if it loses the connection.

System configuration

Wifi AP 3m from the ESP, good signal. Zero wifi reconnects shown on status page.
MQTT broker is mosquitto, server (cabled) on same subnet as ESP.

ESP Easy version:

Build:⋄20104  - Mega
System Libraries:⋄ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support
Git Build:⋄mega-20191103
Plugins:⋄46 [Normal]
Build Time:⋄Nov  3 2019 03:18:10
Binary Filename:⋄ESP_Easy_mega-20191103_normal_ESP8266_4M1M.bin

Replicated also on build 20191208; exactly the same behaviour. Replicated on three different ESP modules.

Rules

Rules on one ESP are minimal - just a timer to reboot if wifi is disconnected for 5 mins. Rules on the other include 7-segment display of sensor values.

Log data

15371459: EVENT: Rules#Timer=1
15371671: ACT  : 7dt,                
15371677: 7DGT : Show Temperature=0.00
15371691: ACT  : TaskValueSet,12,1,4+1             
15371703: Command: taskvalueset
15371716: ACT  : TaskValueSet,12,1,1
15371728: Command: taskvalueset
15371736: ACT  : timerSet,1,5
15371748: Command: timerset
15377463: EVENT: Rules#Timer=1
15377554: ACT  : 7dt,23.3             
15377560: 7DGT : Show Temperature=23.30
15377696: ACT  : TaskValueSet,12,1,1+1             
15377709: Command: taskvalueset
15377727: ACT  : timerSet,1,5
15377738: Command: timerset
15377778: MQTT : Connection lost, state: Connection timeout
15377779: EVENT: MQTT#Disconnected
15377896: MQTT : Connected to broker with client ID: GymClimate_12
15377897: Subscribed to: domoticz/out
15377900: EVENT: MQTT#Connected
15378123: DHT  : Temperature: 25.10
15378123: DHT  : Humidity: 42.20
15378127: EVENT: DHT#Temperature=23.34
15378205: EVENT: DHT#Humidity=42.20
15378306:  Domoticz: Sensortype: 2 idx: 15 values: 23.3;42.2;1
15379749: DS   : Temperature: 32.44 (28-ff-c4-43-91-16-4-b9)
15379752: EVENT: DA#Temperature=32.44
15379853:  Domoticz: Sensortype: 1 idx: 13 values: 32.4
15379885: DS   : Temperature: 36.63 (28-ff-83-a6-c4-17-4-43)
15379888: EVENT: DB#Temperature=36.63
15379988:  Domoticz: Sensortype: 1 idx: 14 values: 36.6
15383459: EVENT: Rules#Timer=1
15383612: ACT  : 7dn,42.2          
15383617: 7DGT : Show Number=42
15383754: ACT  : TaskValueSet,12,1,2+1             
15383769: Command: taskvalueset
15383845: ACT  : timerSet,1,5
15383857: Command: timerset
15388459: EVENT: Clock#Time=Thu,11:58
15389459: EVENT: Rules#Timer=1
15389663: ACT  : 7dt,32.4            
15389668: 7DGT : Show Temperature=32.40
15389751: ACT  : TaskValueSet,12,1,3+1             
15389764: Command: taskvalueset
15389809: ACT  : timerSet,1,5
15389823: Command: timerset
15392730: WD   : Uptime 257 ConnectFailures 0 FreeMem 18104 WiFiStatus 3
15395466: EVENT: Rules#Timer=1
15395739: ACT  : 7dt,36.6            
15395744: 7DGT : Show Temperature=36.60
15395787: ACT  : TaskValueSet,12,1,4+1             
15395800: Command: taskvalueset
15395839: ACT  : TaskValueSet,12,1,1
15395852: Command: taskvalueset
15395860: ACT  : timerSet,1,5
15395872: Command: timerset
15401459: EVENT: Rules#Timer=1
15401559: ACT  : 7dt,23.3             
15401565: 7DGT : Show Temperature=23.30
15401747: ACT  : TaskValueSet,12,1,1+1             
15401760: Command: taskvalueset
15401805: ACT  : timerSet,1,5
15401818: Command: timerset
15407459: EVENT: Rules#Timer=1
15407610: ACT  : 7dn,42.2          
15407616: 7DGT : Show Number=42
15407749: ACT  : TaskValueSet,12,1,2+1             
15407762: Command: taskvalueset
15407807: ACT  : timerSet,1,5
15407819: Command: timerset
15413463: EVENT: Rules#Timer=1
15413664: ACT  : 7dt,32.4            
15413670: 7DGT : Show Temperature=32.40
15413759: ACT  : TaskValueSet,12,1,3+1             
15413772: Command: taskvalueset
15413819: ACT  : timerSet,1,5
15413831: Command: timerset
15419459: EVENT: Rules#Timer=1
15419705: ACT  : 7dt,36.6            
15419710: 7DGT : Show Temperature=36.60
15419752: ACT  : TaskValueSet,12,1,4+1             
15419765: Command: taskvalueset
15419806: ACT  : TaskValueSet,12,1,1
15419818: Command: taskvalueset
15419827: ACT  : timerSet,1,5
15419840: Command: timerset
15421163: Dummy: value 1: 1.00
15421163: Dummy: value 2: 0.00
15421163: Dummy: value 3: 0.00
15421163: Dummy: value 4: 0.00
15421165: EVENT: MyVars#cycle=1.00
15421243: EVENT: MyVars#=0.00
15421321: EVENT: MyVars#=0.00
15421399: EVENT: MyVars#=0.00
15422729: WD   : Uptime 257 ConnectFailures 0 FreeMem 19064 WiFiStatus 3
15425459: EVENT: Rules#Timer=1
15425552: ACT  : 7dt,23.3             
15425558: 7DGT : Show Temperature=23.30
15425727: ACT  : TaskValueSet,12,1,1+1             
15425739: Command: taskvalueset
15425757: ACT  : timerSet,1,5
15425768: Command: timerset
15431459: EVENT: Rules#Timer=1
15431589: ACT  : 7dn,42.2          
15431594: 7DGT : Show Number=42
15431693: ACT  : TaskValueSet,12,1,2+1             
15431706: Command: taskvalueset
15431725: ACT  : timerSet,1,5
15431737: Command: timerset
15437459: EVENT: Rules#Timer=1
15437630: ACT  : 7dt,32.4            
15437635: 7DGT : Show Temperature=32.40
15437720: ACT  : TaskValueSet,12,1,3+1             
15437734: Command: taskvalueset
15437752: ACT  : timerSet,1,5
15437764: Command: timerset
15440916: MQTT : Connection lost, state: Connection timeout
15440917: EVENT: MQTT#Disconnected
15441032: MQTT : Connected to broker with client ID: GymClimate_12
15441034: Subscribed to: domoticz/out
15441036: EVENT: MQTT#Connected
15441260: DHT  : Temperature: 25.10
15441260: DHT  : Humidity: 42.20
15441264: EVENT: DHT#Temperature=23.34
15441342: EVENT: DHT#Humidity=42.20
15441442:  Domoticz: Sensortype: 2 idx: 15 values: 23.3;42.2;1
@TD-er
Copy link
Member

TD-er commented Jan 2, 2020

The connection timeout reported is probably the timeout of the broker, not the ESP side.
In the logs you showed the time between the reconnect is about 63 seconds.
So maybe the keep alive is not sent often enough, or the broker has a timeout which is too short?
What keep alive (or whatever it is called) time do you use in Mosquito?
I think we use 10 or 15 seconds as timeout in PubSubClient.

@jefft4
Copy link
Author

jefft4 commented Jan 2, 2020

As far as I understand mosquitto & MQTT, the client decides the keepalive timeout. The mosquitto broker appears able to put a limit on it, in MQTT v5+ only, but not to fix it. From what I can find, its default timeout seems to be 60 seconds; that should result in clients being cut off if they've not sent a ping for 90 seconds (spec calls for a +50% tolerance). The client could request any timeout up to 18 hours in its initial connection request.

From the MQTT spec:

When the client establishes a connection to the broker, the client communicates a time interval in seconds to the broker. This interval defines the maximum length of time that the broker and client may not communicate with each other.
...
The broker must disconnect a client that does not send a a message or a PINGREQ packet in one and a half times the keep alive interval. Likewise, the client is expected to close the connection if it does not receive a response from the broker in a reasonable amount of time.

Here's a snippet from my mosquitto log file; interesting to see what the broker is seeing:

1577946354: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946354: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946416: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.
1577946416: New connection from 192.168.8.174 on port 1883.
1577946416: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946416: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946458: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.
1577946458: New connection from 192.168.8.174 on port 1883.
1577946458: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946458: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946479: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.

The socket error is interesting - suggests that the ESP has either dropped the connection or had some other problem with it. It seems to happen with this ESP, 42 seconds after the connection is first made.

I'm not sure what's going on here. If the broker timeout is 60s, the client end is 10s and it's the client's responsibility to initiate the ping, then we should never get anywhere close to the 90s broker limit. Maybe the broker is trying to send the ping response message when it hits that socket error.

If you can think of how to get any more information about what's going on at the ESP end to cause that socket error, I'm happy to try some more debugging...

@TD-er
Copy link
Member

TD-er commented Jan 2, 2020

Can you keep sending a ping from any host on the network to this ESP?
It looks like the ESP is not receiving a reply from the broker or not replying to some request from the broker and thus is considered to be offline.

@jefft4
Copy link
Author

jefft4 commented Jan 3, 2020

I set a 2-second repeating ICMP ping running from the server to that ESP and the MQTT problems continued; the ICMP ping was fine, no packets dropped at all. Also, I have the web UI of the ESP open while I'm doing this, watching its log, which is working fine. It looks like it's not a generic network issue, rather something MQTT-specific.

Sending a MQTT PINGREQ from anything else won't be possible, as MQTT is client-to-server. I will try subscribing the ESP to some random topic and using mosquitto_pub to spam that topic every 5 seconds, see if that keeps it alive.

How do (can?) one convert the ESP log timestamps into real times? If I could line up the broker and ESP logs, we might be able to see which disconnects first (I suspect the ESP as it says 'timeout', but I think we need to confirm that).

@jefft4
Copy link
Author

jefft4 commented Jan 3, 2020

Thought I'd try it from the ESP side first...

  • I set up a fresh ESP. Same disconnect behaviour right away when I set up the controller, even with no devices defined.
  • I added an ESP rule to publish on a nonsense MQTT topic every 5 seconds. It's been running for several minutes now, with no disconnects.
  • With the publish timer at 10 seconds, the disconnection problem reappears; 9 seconds is ok.
    This is starting to feel like it's connected with the ESP's client timeout value of 10 seconds!

@jefft4
Copy link
Author

jefft4 commented Jan 3, 2020

Re-flashed the ESP with the 20191208 test binary to get the MQTTImport plug-in.
Interestingly, I see this in mosquitto's log with the test version - looks like k10 might indicate the client asked for a keepalive of 10s:
New client connected from 192.168.8.66 as TESTESP-Import (c0, k10)

With MQTTimport configured, but nothing publishing its topic, we get disconnections roughly every 30 seconds! If I spam the MQTTimport topic from another machine, the ESP is still disconnecting every 30s, even with spamming every 5s.

Analysis:

  • If the ESP isn't sending a real message every 10 seconds or less, we get disconnections.
  • Sending messages to the ESP doesn't help; the ESP has to be the one sending.
    Thought:
  • Is the ESP not in fact sending PINGREQ properly? That would create this situation, as it would mean no traffic from the ESP for >10s.
    Next step:
  • I'll put mosquitto into debug mode and look at the full traffic including PINGREQ. I'll have to set up mosquitto on a test system to do that, so will report back later...

@jefft4
Copy link
Author

jefft4 commented Jan 3, 2020

[Sorry if you get repeated notifications of edits... trying not to make the thread any longer than necessary!]

Test broker worked fine, no disconnections and the PINGREQ/PINGRESP pairs are there. In the end, I had to run mosquitto in the foreground on the live system with debugging on; not pleasant, but I think it may have exposed what's causing the problems.

On my live server, I saw the socket errors; they were coming after a barrage of domoticz/out messages being sent to the ESP - sometimes as many as 20 or 30 messages in a couple of seconds. I think that what's happening, is that the ESP is simply being swamped and either the MQTT client crashes out or simply stops responding, causing the server to see the ESP as unavailable.

I've found a workaround, I think: if the ESP doesn't need to subscribe, then set its subscription topic to 'notinterested'. This needs a restart of mosquitto, as the ESP doesn't seem to unsubscribe and disconnect cleanly when its config is changed, so otherwise the broker remembers its old subscription. Fortunately, this applies to most of my ESPs :-)

Does the 'swamped' hypothesis sound correct, TD-er? If so, is there anything we can do with the ESPEasy controller parameters (timeout, queue, etc) to lessen its impact for ESPs that do need to subscribe to domoticz's spammy output?

I've also suggested to the domoticz developers, that its protocol needs to be changed to include the 'idx' in the MQTT topic - e.g. 'domoticz/out/999' - so that devices have the option to subscribe more precisely and reduce the network traffic.

@giig1967g
Copy link
Contributor

Latest versions of ESPEasy are very stable in my opinion.
I have been using in my production ESP units (10 units so far) version 20191104 for 37 days with MQTT Mosquitto..
In one unit I use 3 instances of MQTT Import without any problem.
In the last 37 days I have never had any disconnection issue, not a single one.
(I have a rule that creates an alert if the ESP units reconnects).

See screenshots:
Screenshot 2020-01-03 at 17 44 14

Screenshot 2020-01-03 at 17 44 05

Screenshot 2020-01-03 at 17 45 35

@jefft4
Copy link
Author

jefft4 commented Jan 3, 2020

You don't appear to be using domoticz. Like I said, the issue appears (subject to TD-er, who knows more than either of us, confirming my hypothesis) to stem from the ESP's inability to cope with the volume of MQTT traffic created by domoticz's broadcast approach.

@TD-er
Copy link
Member

TD-er commented Jan 3, 2020

Well I am using Domoticz with MQTT on my nodes.
image
Obviously this one is not running the latest builds, but it is running a self made build, I made on Sept 24.

So you could try a build from around that time just to make sure I didn't break anything recently.

But I do think your Mosquito may be using a shorter timeout.
It is also possible the node gets swamped, by traffic, but this one was also running with about 15 nodes all subscribing to the same MQTT server.
I've also ran a few tests every now and then to see if the nodes can handle very short "time between messages" and other stuff to flood the network.
Having a log level on debug or "debug more" may also put a strain on the node as that does result in quite a few verbose log lines.
The serial log and webserver log both use their own log buffer, which may put quite some strain on the resources.
Also I have the feeling high frequent reloads, which the log viewer does with lots of logs, does sometimes trigger a crash or at least keep some resources allocated.

@jefft4
Copy link
Author

jefft4 commented Jan 3, 2020

Interesting; potentially a number of things combining to overload the ESP then. My 'spammer' script caused the ESP to drop when it was sending around 25~30 messages in a second. They weren't huge messages, about 50 bytes. It didn't have to be sustained; one second like that was enough to break things. MQTT keepalive timeout, FYI, is set by the client; mosquitto has no control over it.

I do have a lot of sensors - screenshot below is just the temperature page; utility page is even longer - so high message volume is quite likely if several reporting intervals coincide. Some of my ESPs are publishing 4 or 5 sensors each time; I guess I could adjust the reporting intervals of each sensor by a few seconds +/-, might help there.

I changed all my ESPs to subscribe to a dummy topic where possible and they're happy little things now; with less in the log, I then spotted that devices running Tasmota were suffering in the same way... so the problem is something that ESP and Tasmota have in common (which I understand is quite a lot) - buffer space definitely plausible, either at the MQTT or the TCP level.

image

I'll pester the domoticz dev's a bit about their spammy messaging approach; I think that's inefficient on large systems and unwise for small-CPU IoT devices, so I'd like them to think about some small improvements to it.

Do you want to close this issue, or keep it open for investigation of what's breaking at the ESP end and whether the performance could degrade somehow more gracefully?

For what it's worth, I think that a feature request to make the keepalive timeout configurable would be useful; and also the issue of not unsubscribing when the ESP MQTT config is changed needs a fix (though low-priority). I have a couple of other ideas, which I'll raise if they've not already been raised... and send a few coffees over via Patreon!

@TD-er
Copy link
Member

TD-er commented Jan 3, 2020

Well the subscribing itself should not really be an issue as they are not really causing a lot of memory allocations.
It is keeping the node busy I guess, but should not cause these issues.
Like I said, it may cause excessive memory usage when you have the log levels set high.

Tasmota is using our version of PubSubClient, so there should not be a lot of difference in there.

@Delfuego-NL
Copy link

I had the same issue. I changed the mqtt settings in Domoticz to: Hierarchical - publish outgoing messagen on topic domoticz/out/${floorplan name}/${plan name}
After that I put the ESP in a seperate room in Domoticz and subscribed the controller on the ESP to that specific topic. After that the Lost connection events did not happen anymore. Maybe this helps?!

@TD-er
Copy link
Member

TD-er commented Jan 20, 2021

Hmm did not know about this feature of Domoticz.
Makes sense to have it.

What do you use as floor plan name and plan name?
Do you use variables in the topic of ESPEasy, or hard coded per unit?

@Delfuego-NL
Copy link

I don't think it matters what names you use. Just create a room/plan (I used "Tuin" (Dutch for garden)) and then select the devices you want in that room/plan, then create a floor plan (I used "Lunar") and put that room in your floor plan. That's all there is to it.
In ESPEasy, under Controllers, select Domoticz MQTT. At Controller Subscribe, fill in: domoticz/out/$Lunar/$Tuin
Of course, in Domoticz you must choose Hierarchical as described in my previous post.

Interesting is that when you choose Combined in Domoticz, the Disconnections appear again. I had not expected this.

@TD-er
Copy link
Member

TD-er commented Jan 20, 2021

Hmm that's strange indeed.
I will have a look at it as I'm also curious about this (for me) new feature of Domoticz.

I have not seen disconnects, but it may have to do with the ESP subscribing to the same topic as all nodes do which may cause a lot of traffic.
Whet this traffic reaches some kind of threshold, it may cause the ESP to run out of resources or time to process the wifi traffic causing the disconnect.

@TD-er TD-er added Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired labels Jan 20, 2021
@Delfuego-NL
Copy link

Perhaps it is also worth adding that I use 38 shelly devices that all communicate via mqtt. Each device has more than one IDX. A total of 185 IDXs are shouting on the mqtt channel. I can imagine a very compact ESP28 device having trouble evaluating all those messages.

@TD-er
Copy link
Member

TD-er commented Jan 20, 2021

Do you also have logging enabled to either a log server or serial (or both) ?
And if so, what log levels?

@Delfuego-NL
Copy link

I'm sorry. I don't have any logging enabled. Only log I use is the standard log in de Web-UI. This ESP is my first and only one and I only got it up and running for three days now.

I'm happy to provide some logfiles but I have no clue how to set it up (log server).

@TD-er
Copy link
Member

TD-er commented Jan 20, 2021

Nope, that's fine.
It is just that generating logs may also use resources and thus push the ESP into instability a bit sooner.

@H4nsie
Copy link

H4nsie commented Jul 11, 2023

Had the same connect/reconnects of MQTT and Domoticz. Changed the setting of the MQTT gateway on Domoticz from topic 'Flat' to topic 'Index (with retain) and no more reconnects now.

Scherm­afbeelding 2023-07-11 om 10 15 10

Thanks!

@TD-er
Copy link
Member

TD-er commented Jul 11, 2023

OK, seems that we should update the Domoticz controller(s) to support the new hierarchical structures.

However I do not understand why you can now use "Index" since ESPEasy will not publish messages like this.
Only reason I can think of why it may experience disconnects when using the "Flat" structure is that it may cause a lot of incoming messages to each node as there is a lot of traffic to the same topics.
If the node is already on a poor WiFi connection, this may cause all kinds of issues.

@H4nsie
Copy link

H4nsie commented Jul 11, 2023

That setting is for publishing Domoticz's outgoing messages only. So that's why published messages from ESP to Domoticz will keep working.

The solution only worked for 1 ESP. My 'ESP_Easy_WATERMETER_1' has stable connection, my 'ESP_Rfid_reader_2' is still reconnecting every approx. 20 sec. Both ESP's have (almost) same config.

mosquitto.log

2023-07-11T11:41:53: New client connected from 192.168.10.223:56787 as ESP_Easy_WATERMETER_1 (p2, c0, k10).
2023-07-11T11:42:01: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:01: New connection from 192.168.10.220:55688 on port 1883.
2023-07-11T11:42:01: New client connected from 192.168.10.220:55688 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:42:39: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:39: New connection from 192.168.10.220:54374 on port 1883.
2023-07-11T11:42:39: New client connected from 192.168.10.220:54374 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:42:59: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:59: New connection from 192.168.10.220:55002 on port 1883.
2023-07-11T11:42:59: New client connected from 192.168.10.220:55002 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:43:20: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:43:20: New connection from 192.168.10.220:55444 on port 1883.
2023-07-11T11:43:20: New client connected from 192.168.10.220:55444 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:43:43: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:44:10: New connection from 192.168.10.220:52634 on port 1883.
2023-07-11T11:44:10: New client connected from 192.168.10.220:52634 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:44:48: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:44:48: New connection from 192.168.10.220:65239 on port 1883.
2023-07-11T11:44:48: New client connected from 192.168.10.220:65239 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:08: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:08: New connection from 192.168.10.220:59602 on port 1883.
2023-07-11T11:45:08: New client connected from 192.168.10.220:59602 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:28: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:28: New connection from 192.168.10.220:57266 on port 1883.
2023-07-11T11:45:28: New client connected from 192.168.10.220:57266 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:48: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:48: New connection from 192.168.10.220:64739 on port 1883.
2023-07-11T11:45:48: New client connected from 192.168.10.220:64739 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:46:08: Client ESP_Rfid_reader_2 closed its connection.

On the manpage I read 'Only the first controller can be a MQTT capable controller.' What is meant by 'controller'? The configured controller in the ESP? Or the node-number of the ESP? (Since my 'ESP_Easy_WATERMETER_1' has node 1 and is working...)

@TD-er
Copy link
Member

TD-er commented Jul 11, 2023

The controller in ESPEasy.
You can have several controllers active, like P2P, some HTTP controller, etc.
We have several MQTT controllers (Domoticz MQTT, OpenHAB MQTT, and some more).
Of those MQTT controllers, you can only have one enabled since running multiple instances of PubSubClient (the library we use to connect to a MQTT broker) causes issues.

@TD-er
Copy link
Member

TD-er commented Jul 11, 2023

It seems the ESP looses its connection every 20 seconds.
Is it possible you have some keep-alive setting in your MQTT broker set to 20 seconds or less?

@H4nsie
Copy link

H4nsie commented Jul 11, 2023

'Almost same config', but different builds...

ESP_Easy_WATERMETER (working correct) : Build: [ESP_Easy_mega_20230409_normal_ESP8266_4M1M Apr 9 2023](https://github.com/letscontrolit/ESPEasy/releases)
ESP_Rfid_reader (reconnects) : Build: [ESP_Easy_mega_20230623_normal_ESP8266_4M1M Jun 23 2023](https://github.com/letscontrolit/ESPEasy/releases)

@TD-er
Copy link
Member

TD-er commented Jul 11, 2023

What is the Client Timeout in your controller settings on ESPEasy?

Does this behavior change when you start sending pings from whatever machine in your network to the ESPEasy node that reconnects?

Can you enable the TimingStats on the Tools-Advanced page of that reconnecting node and let it run for a few minutes and then visit the TimingStats page (button on Tools page)?
N.B. on every load of that timing stats page, the stats are cleared.

@sincze
Copy link

sincze commented Mar 13, 2024

What interests me most is why only ESPEASY and not the other devices (like tasmota)
Log from 00:00 - 09:00 (publish timer at 9 sec)

1710316040: New client connected from 192.168.*.*:53556 as SHELLY_P1_B1D3FA (p2, c1, k30, u'DVES_USER').
1710316194: New connection from 192.168.*.*:33481 on port 1883.
1710316194: New client connected from 192.168.*.*:33481 as auto-7CB139D0-AAC7-A83C-1A7B-37302542CC96 (p2, c1, k60).
1710316194: Client auto-7CB139D0-AAC7-A83C-1A7B-37302542CC96 closed its connection.
1710316282: Client Watermeter closed its connection.
1710316282: New connection from 192.168.2.25:52151 on port 1883.
1710316282: New client connected from 192.168.2.25:52151 as Watermeter (p2, c0, k10).
1710316673: Client SHELLY_P1_B1D3FA closed its connection.
1710316674: New connection from 192.168.*.*:50465 on port 1883.
1710316674: New client connected from 192.168.*.*:50465 as SHELLY_P1_B1D3FA (p2, c1, k30, u'DVES_USER').

Log from 09:00 - xx (publish timer at 30 sec)

1710316040: New client connected from 192.168.*.*:53556 as SHELLY_P1_B1D3FA (p2, c1, k30, u'DVES_USER').
1710316194: New connection from 192.168.*.*:33481 on port 1883.
1710316194: New client connected from 192.168.*.*:33481 as auto-7CB139D0-AAC7-A83C-1A7B-37302542CC96 (p2, c1, k60).
1710316194: Client auto-7CB139D0-AAC7-A83C-1A7B-37302542CC96 closed its connection.
1710316282: Client Watermeter closed its connection.
1710316282: New connection from 192.168.2.25:52151 on port 1883.
1710316282: New client connected from 192.168.2.25:52151 as Watermeter (p2, c0, k10).
1710316673: Client SHELLY_P1_B1D3FA closed its connection.
1710316674: New connection from 192.168.*.*:50465 on port 1883.
1710316674: New client connected from 192.168.*.*:50465 as SHELLY_P1_B1D3FA (p2, c1, k30, u'DVES_USER').
1710316791: Client Jablotron closed its connection.
1710316791: New connection from 192.168.*.*:55779 on port 1883.
1710316791: New client connected from 192.168.*.*:55779 as Jablotron (p2, c0, k10).
1710316942: Client Jablotron closed its connection.
1710316942: New connection from 192.168.*.*:53805 on port 1883.
1710316942: New client connected from 192.168.*.*:53805 as Jablotron (p2, c0, k10).
1710316992: Saving in-memory database to /mosquitto/data//mosquitto.db.
1710317002: Client Jablotron closed its connection.
1710317002: New connection from 192.168.*.*:61543 on port 1883.
1710317002: New client connected from 192.168.*.*:61543 as Jablotron (p2, c0, k10).
1710317087: Client Watermeter closed its connection.
1710317087: New connection from 192.168.*.*:58307 on port 1883.
1710317087: New client connected from 192.168.*.*:58307 as Watermeter (p2, c0, k10).
1710317141: Client Jablotron closed its connection.
1710317142: New connection from 192.168.*.*:61544 on port 1883.
1710317142: New client connected from 192.168.*.*:61544 as Jablotron (p2, c0, k10).

@TD-er
Copy link
Member

TD-er commented Mar 13, 2024

Maybe Tasmota is using a shorter keep-alive setting?

@sincze
Copy link

sincze commented Mar 13, 2024

Maybe Tasmota is using a shorter keep-alive setting?

Mmm 15?

https://github.com/arendst/Tasmota/blob/26a3eacbd63c04d05dbabd83a2189b2da39b5f11/lib/default/pubsubclient-2.8.13/src/PubSubClient.h

image


// MQTT_KEEPALIVE : keepAlive interval in Seconds. Override with setKeepAlive()
#ifndef MQTT_KEEPALIVE
#define MQTT_KEEPALIVE 15
#endif

// MQTT_SOCKET_TIMEOUT: socket timeout interval in Seconds. Override with setSocketTimeout()
#ifndef MQTT_SOCKET_TIMEOUT
#define MQTT_SOCKET_TIMEOUT 15
#endif

v2.8 Tasmota
https://github.com/arendst/Tasmota/blob/26a3eacbd63c04d05dbabd83a2189b2da39b5f11/lib/default/pubsubclient-2.8.13/CHANGES.txt

  • Add setBufferSize() to override MQTT_MAX_PACKET_SIZE
  • Add setKeepAlive() to override MQTT_KEEPALIVE
  • Add setSocketTimeout() to overide MQTT_SOCKET_TIMEOUT
  • Added check to prevent subscribe/unsubscribe to empty topics
  • Declare wifi mode prior to connect in ESP example
  • Use strnlen to avoid overruns
  • Support pre-connected Client objects

v2.7 ESPEasy
https://github.com/letscontrolit/ESPEasy/blob/8cf90505608dd0e79299f55457e6252dec2998fe/lib/pubsubclient/CHANGES.txt

@tonhuisman
Copy link
Contributor

It might be interesting to follow this PubSubClient issue: knolleary/pubsubclient#1045 🤔

@sincze
Copy link

sincze commented Mar 13, 2024

I noticed a lot of traffic before I see the reconnect. (domoticz/out)
image

A "TCP window full" condition, could means that the receiving end of the TCP connection (Espeasy0 has filled its buffer and cannot accept any more data at the moment. This situation typically occurs when the receiver is unable to process data quickly enough, causing its buffer to become full.

ESPEasy is responding to mqtt with a "TCP ZeroWindow" this could mean that the receiver's buffer for incoming data is completely full, and it cannot accept any more data at the moment.

image
No Clue what the link would be with domoticz/out

2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (281 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (392 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (286 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (286 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (286 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (280 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (281 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (280 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (281 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (280 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (281 bytes))
2024-03-13T19:14:36: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (450 bytes))
2024-03-13T19:14:37: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (330 bytes))
2024-03-13T19:14:37: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (334 bytes))
2024-03-13T19:14:38: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (279 bytes))
2024-03-13T19:14:39: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (319 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (392 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (286 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (286 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (286 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (280 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (281 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (280 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (281 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (280 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (281 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (450 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (364 bytes))
2024-03-13T19:14:41: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (334 bytes))
2024-03-13T19:14:42: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (330 bytes))
2024-03-13T19:14:42: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (293 bytes))

@TD-er
Copy link
Member

TD-er commented Mar 13, 2024

Maybe related to some QoS flags?
MQTT QoS flags define guarantees about subscribed clients receiving published messages.
I could have a look to see is something has recently changed in controller queue handling where it might try to push out messages more than processing received messages?
Do you also subscribe to topics to which your devices also publish?
Or only to topic(s) where Domoticz publishes to?

Maybe Domoticz recently changed its QoS flags when sending messages?

Maybe there is more memory reserved on the 'working' docker image for delivering messages to subscribers?
Maybe saving the state every 30 seconds is a bit too much? (in Mosquitto)

@sincze
Copy link

sincze commented Mar 13, 2024

Maybe related to some QoS flags?
MQTT QoS flags define guarantees about subscribed clients receiving published messages.
I could have a look to see is something has recently changed in controller queue handling where it might try to push out messages more than processing received messages?
Do you also subscribe to topics to which your devices also publish?
Or only to topic(s) where Domoticz publishes to?

Maybe Domoticz recently changed its QoS flags when sending messages?

Maybe there is more memory reserved on the 'working' docker image for delivering messages to subscribers?
Maybe saving the state every 30 seconds is a bit too much? (in Mosquitto)

The working docker has no active domoticz attached to it :)

But question why would my espeasy (jablotron) be interested in domoticz/out ??

It only publishes to different topics and domoticz/in

@TD-er
Copy link
Member

TD-er commented Mar 13, 2024

The Domoticz MQTT controller subscribes to it so it can receive messages sent by Domoticz.

However it was a rather poor scaling design decision of Domoticz as you all publish to the same topic and subscribe to the same topic.

@sincze
Copy link

sincze commented Mar 13, 2024

The Domoticz MQTT controller subscribes to it so it can receive messages sent by Domoticz.

However it was a rather poor scaling design decision of Domoticz as you all publish to the same topic and subscribe to the same topic.

You mean the MQTT Controller in ESPEasy?
I am using the OpenHAB.
Screenshot_20240313_214832_Chrome.jpg

I used to use the Domoticz one in ESPEasy but with my rules I stopped using it.

@sincze
Copy link

sincze commented Mar 13, 2024

Screenshot_20240313_215539_Chrome.jpg

If my assumption would be right... there is a dormant subscription to domoticz/out...

I again added a Domoticz MQTT Controller
Added domoticz2/out to subsribe to.

Disabled openhab mqtt.
Rebooted the esp.

Now it should continue to work....

@TD-er
Copy link
Member

TD-er commented Mar 13, 2024

Or you can check the checkbox "Unique Client ID on Reconnect:" or if your setup doesn't allow for unique client IDs, you could try "Clean Session:"

@sincze
Copy link

sincze commented Mar 13, 2024

image

The theory still stands. No Timeouts after modification. (Mod is to re-add domoticz mqtt and the non existent topic)

So it could be the issue that:

  • ESPEASY with FLAT domoticz/out gets to many messages (stated elsewhere on the forum already. Idea was to change the setting from FLAT to INDEX (however that will not work for the most of us)

image

  • If Domoticz MQTT Controller is REMOVED from the ESPEASY template (as I did when I moved to the OpenHab MQTT, it still continues to receive messages it was once subscribed to, (it did not unsubscribe) so still causing the issues

Log after modification! No Timeouts

942720: EVENT: Rules#Timer=1,1
942749: ACT : TimerSet,1,30
942755: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-61.00'}'
961665: EVENT: Jablotron_RSSI#rssi=-59.00
963969: WD : Uptime 16 ConnectFailures 0 FreeMem 20448 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
972750: EVENT: Rules#Timer=1,1
972779: ACT : TimerSet,1,30
972786: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-59.00'}'
982414: EVENT: Clock#Time=Wed,23:08
993969: WD : Uptime 17 ConnectFailures 0 FreeMem 20448 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1002781: EVENT: Rules#Timer=1,1
1002813: ACT : TimerSet,1,30
1002819: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-59.00'}'
1021668: EVENT: Jablotron_RSSI#rssi=-61.00
1023969: WD : Uptime 17 ConnectFailures 0 FreeMem 20448 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1032814: EVENT: Rules#Timer=1,1
1032843: ACT : TimerSet,1,30
1032850: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-61.00'}'
1042414: EVENT: Clock#Time=Wed,23:09
1053969: WD : Uptime 18 ConnectFailures 0 FreeMem 20448 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1062845: EVENT: Rules#Timer=1,1
1062877: ACT : TimerSet,1,30
1062883: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-61.00'}'
1081666: EVENT: Jablotron_RSSI#rssi=-60.00
1083969: WD : Uptime 18 ConnectFailures 0 FreeMem 20448 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1093008: EVENT: Rules#Timer=1,1
1093037: ACT : TimerSet,1,30
1093045: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1102414: EVENT: Clock#Time=Wed,23:10
1113969: WD : Uptime 19 ConnectFailures 0 FreeMem 20448 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1123038: EVENT: Rules#Timer=1,1
1123067: ACT : TimerSet,1,30
1123074: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1126199: ESPEasy console using ESPEasySerial
1141667: EVENT: Jablotron_RSSI#rssi=-60.00
1143969: WD : Uptime 19 ConnectFailures 0 FreeMem 21024 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1153069: EVENT: Rules#Timer=1,1
1153099: ACT : TimerSet,1,30
1153106: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1162414: EVENT: Clock#Time=Wed,23:11
1173969: WD : Uptime 20 ConnectFailures 0 FreeMem 20920 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1183100: EVENT: Rules#Timer=1,1
1183130: ACT : TimerSet,1,30
1183136: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1201664: EVENT: Jablotron_RSSI#rssi=-60.00
1203969: WD : Uptime 20 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1213131: EVENT: Rules#Timer=1,1
1213160: ACT : TimerSet,1,30
1213167: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1222414: EVENT: Clock#Time=Wed,23:12
1233969: WD : Uptime 21 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1243162: EVENT: Rules#Timer=1,1
1243194: ACT : TimerSet,1,30
1243200: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1261664: EVENT: Jablotron_RSSI#rssi=-60.00
1263996: WD : Uptime 21 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1273195: EVENT: Rules#Timer=1,1
1273224: ACT : TimerSet,1,30
1273231: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1282414: EVENT: Clock#Time=Wed,23:13
1293969: WD : Uptime 22 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1303226: EVENT: Rules#Timer=1,1
1303257: ACT : TimerSet,1,30
1303264: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-60.00'}'
1321662: EVENT: Jablotron_RSSI#rssi=-61.00
1323969: WD : Uptime 22 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1333259: EVENT: Rules#Timer=1,1
1333288: ACT : TimerSet,1,30
1333295: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-61.00'}'
1342414: EVENT: Clock#Time=Wed,23:14
1353969: WD : Uptime 23 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1363290: EVENT: Rules#Timer=1,1
1363322: ACT : TimerSet,1,30
1363329: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-61.00'}'
1381667: EVENT: Jablotron_RSSI#rssi=-61.00
1383969: WD : Uptime 23 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1393323: EVENT: Rules#Timer=1,1
1393352: ACT : TimerSet,1,30
1393359: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-61.00'}'
1402521: EVENT: Clock#Time=Wed,23:15
1413969: WD : Uptime 24 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1423354: EVENT: Rules#Timer=1,1
1423385: ACT : TimerSet,1,30
1423392: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-61.00'}'
1441665: EVENT: Jablotron_RSSI#rssi=-59.00
1443970: WD : Uptime 24 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1453387: EVENT: Rules#Timer=1,1
1453417: ACT : TimerSet,1,30
1453423: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-59.00'}'
1462414: EVENT: Clock#Time=Wed,23:16
1473969: WD : Uptime 25 ConnectFailures 0 FreeMem 20632 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
1483418: EVENT: Rules#Timer=1,1
1483449: ACT : TimerSet,1,30
1483456: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-59.00'}'

@TD-er
Copy link
Member

TD-er commented Mar 13, 2024

And what if you check the checkbox to start with a clean session?
Not sure how you can unsubscribe to a topic you're not subscribed to?
Maybe for each incoming message looking to see if you are subscribed to it and if not actively unsubscribe?
However I think it should be the broker's responsibility to get rid of subscriptions when a client is disconnected or gets reconnected.

@sincze
Copy link

sincze commented Mar 13, 2024

At the moment there is no issue anymore.

You want me to:

  • Re-subscribe to domoticz/out through Domoticz MQTT
  • Re-cause the issue
  • Remove de Domoticz MQTT controller. (issue should still be present)
  • Enable the OpenHab MQTT Controller + Checkbox to start with a clean session.

Don't know how Tasmota deals with it, it has a direct connection with Domoticz as well, but maybe on IDX
image

Why a client should unsubscribe might be because:
MQTT clients need to unsubscribe from a topic when they no longer want to receive messages published to that topic. This allows the MQTT broker to stop sending messages related to that topic to the client, reducing network traffic and conserving resources on both the client and broker sides. Additionally, unsubscribing ensures that the client does not receive any unwanted messages, improving the efficiency and performance of the MQTT communication

@TD-er
Copy link
Member

TD-er commented Mar 13, 2024

But why should you unsubscribe right after a new login?

@sincze
Copy link

sincze commented Mar 14, 2024

But why should you unsubscribe right after a new login?

I think it might have something to do with the fact that the client can be offline for a while (battery operated e.g.) and then when it wakes up it receives all the messages it 'missed' during its sleep. But hey I'm not an MQTT expert.
I also noticed that switching MQTT from Domoticz to Openhab for example REQUIRES a REBOOT of the ESPEASY to get it working correctly. It claimed it send all kinds of messages but those messages were never received by Domoticz. After the reboot the messages came back in.

image

@TD-er
Copy link
Member

TD-er commented Mar 14, 2024

OK, that last one may actually be a bug, probably in the message queue.
Do you happen to see any subscribe calls from ESPEasy in your Wireshark captures?
I just want to make sure we're drawing the right conclusions whether the broker keeps the subscribed topics list or ESPEasy may make the subscribe calls.

How do you switch between MQTT controllers?
As all MQTT related function calls essentially take settings from the first enabled MQTT controller, it is very important to first disable one and then enable the other.

@sincze
Copy link

sincze commented Mar 14, 2024

image

Ah so:

  • on row 1 I have the MQTT Domoticz
  • on row 2 I have OpenHab.

If you enable both Domoticz and OpenHab ESPEASY will explain that it can only use 1 MQTT. (red bar e.g. you can't miss that)

So I switch between the 2 by opening the one you want to disable, go to the bottom and switch off the ENABLE checkbox.
image

Press Submit. Now the red-bar is gone.
To switch to the other one, disable, press submit, open the other one and enable it press submit.

@TD-er
Copy link
Member

TD-er commented Mar 14, 2024

I think technically it is possible to have both enabled -> Red warning, then disable the other one -> no error/warning shown.

But it may perhaps not clear the MQTT broker connection if you do this.
So that's why I needed to make sure you first disable one and then enable the other.

@sincze
Copy link

sincze commented Mar 14, 2024

Ok I have the mosquitto log running (tail -f):
image
image
Mosquitto log is flooded by and the regular reconnects:
2024-03-14T16:34:24: Sending PUBLISH to Jablotron (d0, q0, r0, m0, 'domoticz/out', ... (433 bytes))

I Disable the Domoticz MQTT and press Submit.

2024-03-14T16:36:45: Received DISCONNECT from Jablotron
2024-03-14T16:36:45: Client Jablotron disconnected.

image

I move over to openHab MQTT and enable it.
image

And Press Submit

2024-03-14T16:38:11: New client connected from 192.168.*.*:65170 as Jablotron (p2, c0, k10).
2024-03-14T16:38:11:    ESP_Easy/Jablotron
2024-03-14T16:38:11: Sending CONNACK to Jablotron (0, 0)
2024-03-14T16:38:12: Received SUBSCRIBE from Jablotron
2024-03-14T16:38:12:    ESP_Easy/Jablotron/in/# (QoS 0)
2024-03-14T16:38:12: Jablotron 0 ESP_Easy/Jablotron/in/#

We notice a workable situation:

2024-03-14T16:38:14: Received PUBLISH from Jablotron (d0, q0, r0, m0, 'domoticz/in', ... (52 bytes))
2024-03-14T16:38:22: Received PINGREQ from Jablotron
2024-03-14T16:38:22: Sending PINGRESP to Jablotron
2024-03-14T16:38:32: Received PINGREQ from Jablotron
2024-03-14T16:38:32: Sending PINGRESP to Jablotron
2024-03-14T16:38:42: Received PINGREQ from Jablotron
2024-03-14T16:38:42: Sending PINGRESP to Jablotron
2024-03-14T16:38:44: Received PUBLISH from Jablotron (d0, q0, r0, m0, 'domoticz/in', ... (52 bytes))
2024-03-14T16:38:52: Received PINGREQ from Jablotron
2024-03-14T16:38:52: Sending PINGRESP to Jablotron
2024-03-14T16:39:02: Received PINGREQ from Jablotron
2024-03-14T16:39:02: Sending PINGRESP to Jablotron
2024-03-14T16:39:13: Received PINGREQ from Jablotron
2024-03-14T16:39:13: Sending PINGRESP to Jablotron
2024-03-14T16:39:15: Received PUBLISH from Jablotron (d0, q0, r0, m0, 'domoticz/in', ... (52 bytes))
2024-03-14T16:39:23: Received PINGREQ from Jablotron
2024-03-14T16:39:23: Sending PINGRESP to Jablotron
2024-03-14T16:39:33: Received PINGREQ from Jablotron
2024-03-14T16:39:33: Sending PINGRESP to Jablotron
2024-03-14T16:39:43: Received PINGREQ from Jablotron
2024-03-14T16:39:43: Sending PINGRESP to Jablotron
2024-03-14T16:39:45: Received PUBLISH from Jablotron (d0, q0, r0, m0, 'domoticz/in', ... (52 bytes))
2024-03-14T16:39:54: Received PINGREQ from Jablotron
2024-03-14T16:39:54: Sending PINGRESP to Jablotron
2024-03-14T16:40:04: Received PINGREQ from Jablotron
2024-03-14T16:40:04: Sending PINGRESP to Jablotron
2024-03-14T16:40:14: Received PINGREQ from Jablotron
2024-03-14T16:40:14: Sending PINGRESP to Jablotron
2024-03-14T16:40:15: Received PUBLISH from Jablotron (d0, q0, r0, m0, 'domoticz/in', ... (52 bytes))

Just a test with unchecked 'Unique Client ID on Reconnect:'

2024-03-14T16:42:08: Client Jablotron disconnected.
2024-03-14T16:42:13: New client connected from 192.168.*.*:65252 as Jablotron (p2, c0, k10).
2024-03-14T16:42:13:    ESP_Easy/Jablotron
2024-03-14T16:42:13: Sending CONNACK to Jablotron (1, 0)
2024-03-14T16:42:13: Received SUBSCRIBE from Jablotron
2024-03-14T16:42:13:    ESP_Easy/Jablotron/in/# (QoS 0)
2024-03-14T16:42:13: Jablotron 0 ESP_Easy/Jablotron/in/#

Works just as well.

Reading through this:
https://www.domoticz.com/forum/viewtopic.php?t=40642

@sincze
Copy link

sincze commented Mar 14, 2024

Domoticz MQTT this way:
image
No problems.

Maybe how Tasmota Solved it. arendst/Tasmota#20299 (comment)
_Domoticz tends to flood MQTT with all kinds of messages. Most of them are irrelevant for a specific device.

In versions before v12.0.1 this lead to a lot of processing before finally determine that the message was not for this specific device.

After v11.1.0 an additional detection was introduced to allow early bail-out of irrelevant Domoticz calls. This detection only allows for a topic as defined in DOMOTICZ_OUT_TOPIC which is default domoticz/out._

But I assume that only works if domoticz/out is not flat but domoticz/out/123

@sincze
Copy link

sincze commented Mar 15, 2024

Would it be a similar case with this one: #4996 (comment)

@TD-er
Copy link
Member

TD-er commented Mar 15, 2024

It might be.
But to be honest, I have no idea what specific change may cause the MQTT incoming messages to be handled more slowly.
As you have demonstrated, your issue is about ESPEasy not responding in due time (or the broker not receiving ESPEasy's response or keep-alive messages) and right now I really don't know where this may have changed.
Problem is that since October 2023 I also merged the code to support the latest ESP-IDF5.1 (to support all new ESP32-variants) which is code I have been working on for over 6 months at that time.
So this may have been introduced in the past 10 months in several 100's of commits.

@sincze
Copy link

sincze commented Mar 15, 2024

Please let me know what I can do. I hope to show I've been testing actively to at least contribute something.

I'm thinking It might become a bigger issue again if all goes to autodiscovery as in homeassistant/#!

For Domoticz autodiscovery can be done on multiple different topics to differentiate between specific hardware (milight / espeasy / tasmota e.g.).

Homeassistant only supports 1 topic for that where it also processes device info.
My dirty trick to have Domoticz & Homeassistant wit Autodiscovery is like this.
image

@TD-er
Copy link
Member

TD-er commented Mar 15, 2024

The problem isn't about the amount of subscribed topics as that's sent only once and dealt with by the broker.
The problem seems to be related to the amount of received messages.

Like I said, the design choice made by Domoticz back in the day to have a single topic for all messages sent to Domoticz and one for messages sent by Domoticz isn't really a scalable choice.

So if you only receive data meant for the single node receiving it, then I don't see how it will become an issue.

However if you receive > 10 messages a second then it will become an issue for WiFi connected devices as you typically will only receive data once every 102.4 msec.
This can be more frequent, but only if it was already clear that there was more data for you when processing a WiFi beacon.

One trick to keep the WiFi radio of an ESP device more active is to continuously send a ping to it from some other host. (not a ping -f, just 1 ping a second)
If this also does make things more stable, then this all might be related to the response-speed to process WiFi data.
So this could be caused by some paradoxical situation where the code to handle messages has been optimized such that the ESP will enter some light sleep mode more frequently and thus turning the WiFi radio off more frequently. Typically the ESP will then only listen to every 3rd WiFi beacon (~300 msec) and if a message or ack then gets lost, you may see more disconnects. (e.g. an ARP message is missed leading to switches no longer knowing where a packet needs to be sent to)

@sincze
Copy link

sincze commented Mar 16, 2024

I think I found a workaround that will not break current integrations for current Domoticz users that have domoticz/out everywhere in more powerful hardware integrations.

The answer I think is in here: https://www.domoticz.com/wiki/MQTT#Publish_Topic

"If Domoticz needs to send to more topics add a new MQTT Client Gateway for each topic"

And under the new (additional) MQTT Client Gateway configuration:
image

Publish Topic:

  • Select the most appropriate for your integration publish on (idx) or (name)

Under Topic In Prefix:

  • To prevent Domoticz from listening to topic domoticz/in (or custom prefix) leave the field empty.

Now you should be able to configure ESPEasy to only subscribe to /domoticz/out/{idx} or /domoticz/out/{name} without breaking existing integrations and losing MQTT connection.
image

Homeassistant already publishes in the following way, so for espeasy that would be no issues.

  • homeassistant/out/binary_sensor/motion_sensor_158d00016be4ab/state

@TD-er
Copy link
Member

TD-er commented Mar 18, 2024

I think the best way would be for ESPEasy to support those newly added structures.
Though I am not sure if this will be an all-or-nothing switch.
Like if you switch to "Index", does the IDX still get included in the message?
If so, then that would by far be the easiest switch a user can make to keep using older installed ESPEasy nodes.

@sincze
Copy link

sincze commented Mar 18, 2024

I created two additional MQTT brokers in domoticz 1 based on Publish Topic: NAME the other on Publish Topic: IDX and monitored the messages using MQTT Explorer.

domoticz/out/Verlichting Keuken (A)

{
	"Battery" : 255,
	"LastUpdate" : "2024-03-18 11:34:56",
	"RSSI" : 5,
	"description" : "Zigbee2MQTT",
	"dtype" : "General",
	"hwid" : "47",
	"id" : "0xa4c138d536d0b4c7_current_zigbee2mqtt",
	"idx" : 1973,
	"name" : "Verlichting Keuken (A)",
	"nvalue" : 0,
	"org_hwid" : "47",
	"stype" : "Current",
	"svalue1" : "0.000",
	"unit" : 1
}

domoticz/out/1973

{
	"Battery" : 255,
	"LastUpdate" : "2024-03-18 11:34:56",
	"RSSI" : 5,
	"description" : "Zigbee2MQTT",
	"dtype" : "General",
	"hwid" : "47",
	"id" : "0xa4c138d536d0b4c7_current_zigbee2mqtt",
	"idx" : 1973,
	"name" : "Verlichting Keuken (A)",
	"nvalue" : 0,
	"org_hwid" : "47",
	"stype" : "Current",
	"svalue1" : "0.000",
	"unit" : 1
}

@sincze
Copy link

sincze commented Mar 18, 2024

To make things a little interesting I started to parse the Mosquitto log to InfluxDB to monitor for reconnects.
image

@TD-er
Copy link
Member

TD-er commented Mar 30, 2024

Can you test the latest testbuild to see if it is now fixed? (or at least working better/more stable)
See the webflasher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired
Projects
None yet
Development

No branches or pull requests

8 participants