Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak #233

Closed
tsightler opened this issue Jul 27, 2019 · 13 comments
Closed

Possible memory leak #233

tsightler opened this issue Jul 27, 2019 · 13 comments

Comments

@tsightler
Copy link
Contributor

tsightler commented Jul 27, 2019

Possibly similar to issue 220 I currently maintain the tuya-mqtt code and have recently modified it to work with tuya 5.1.1 and to support device discovery and protcol 3.3. The modifications are minimal and the code seems to work great, however, early test users are reporting significant increase in memory consumption with some reporting 150MB of consumed RAM in just a few hours.

I've spent some time trying to reproduce and, while I haven't yet been able to get to 100+MB, it's pretty easy for me to see 70-80MB of usage in just 8 hours or so, while, the very same script rarely exceeded 40-50MB usage with 4.0.4 even after running for weeks.

One other things that is very difficult for me to explain. The memory leaks have all been reported from people running the script on RPi systems, and I've also only been able to reproduce it there. I actually develop on a small CentOS 7 based x86 container and there the scripts memory sits around 40-45MB and, once it gets to that point, starts to increase and decrease like you might expect from standard garbage collection. I cannot for the life of me figure out why architecture would make a difference.

I'm admittedly early in testing, so I could still find that somehow I broke something with my minor code updates (you can see my changes here, but since I saw another similar report with the IObroker code (and users there are still reporting significant memory consumption). Any ideas greatly appreciated.

@tsightler
Copy link
Contributor Author

Possibly useful is that the memory usage seems to increase even when the tuya-mqtt portion of the script is doing basically nothing, just waiting on state changes via tuyapi and mqtt events. When running in debug you see nothing but the normal ping-pong to devices every 10 seconds, with no activity on the devices themselves, yet the memory consumption just continues to increase, in small amounts, typically 4-8 bytes at a time.

I will continue testing, I've never had to troubleshoot a memory leak on NodeJS so I'm reading some tutorials, but suggestions are welcome.

@kueblc
Copy link
Collaborator

kueblc commented Jul 27, 2019

I'm wondering if it might be related to resolvers management (d269006)

@tsightler
Copy link
Contributor Author

I'm not sure. I've been doing a lot of testing with this, using the same script and set of devices against various tuyapi versions. It feels like the problem shows up between 4.0.4 and 5.0.1 (I haven't been able to test 5.0.0 due to a bug that seems to make it unusable).

That being said, I've also been investigating why I see such different behavior on my CentOS development machine. Other than the architecture (x86 vs ARM), and Linux distro (CentOS 7 vs Raspbian Buster), the other major difference was that I was using Nodesource 10.16.0 distro of NodeJS on the CentOS while I was using the stock distro version (10.15.2) from the Raspbian base repo.

I uninstalled this version and installed the Nodesource Debian distro of 10.16.0 on the RPi and now the memory behavior is quite similar between the two systems and, at least so far, I can't really see any significant memory growth (only been running a couple of hours though).

I have no idea how memory usage can be so different between two different NodeJS distros, but perhaps Nodesource does some things to optimize garbage collection? I'll continue testing over the next few days and report what I find.

@Apollon77
Copy link
Collaborator

Thank you very much for all the tine investment in research.

Oohh the node js version is an interesting fact. I will ask my affected users about it. Maybe nodejs itself has a leak somehow?

@tsightler
Copy link
Contributor Author

I'm definitely not sure that the problem is related, just the behavior between my dev system and RPi system is far more similar, and the total memory use after 2 hours is much less, but there's still a slight trend up.

Later this evening I'm going to build a simple test script that does nothing but make persistent connections to a list of devices and sit there listening for events and see if it reproduces. Then i can easily test each tuyapi version.

I'm currently suspicious of the setInterval for the ping/pong that was changed from setTineout in 3.2.3 because I think I see some memory growth in 4.0.4 while 3.2.3 stabilizes almost immediately but it's all just conjecture at this point pending test results which take time

I've looked through all the changes and that's the only thing that jumps out as I've seen cases where setInterval can cause this issue since it never ends and the memory seems to increase even just from ping pong.

@Apollon77
Copy link
Collaborator

Then try to change it back for you locally and see what it does?!

@tsightler
Copy link
Contributor Author

tsightler commented Jul 28, 2019

That's exactly my plan. I did prove that the different NodeJS versions don't really make that much difference, yes, they start off with lower memory overall, and, for some reason, it takes them longer to begin showing the pattern, but, after about 30 minutes, they begin growing with the same pattern. Sure, 4 hours in the nodesource builds are using about 20MB less (~50MB vs ~70MB), but that's only because they started at a lower base memory and, for some reason, took longer to start growing after startup.

So now I'm testing a modified, 5.1.1, with the pingpong code reverted to the old style method, and running a separate test with 4.0.4 (which still includes the setInterval implementation). Admittedly, it's just a complete stab in the dark, I don't see anything wrong in the existing implementation that would cause an issue, but it was easy to test so figured I might as well give it a shot.

It takes about an 60-90 minutes before it's obvious if the memory is growing or not. When there is no leak there's a peak memory size right after initialization, and then the size of the process actually drops over the next 60 minutes, while, when there is a "leak" (I'm still not sure it's really a leak per-se, but something is using the memory), there's an initial allocation, then the process shrinks some, but not as much, and then starts growing again eventually surpassing the original size and continuing.

I should have results of the existing testing in another few hours.

@tsightler
Copy link
Contributor Author

Ok, my modified 5.1.1 didn't change anything, after 12 hours the process was still close to 70MB. The same code with 4.0.4 and 5.0.0 was only at 40-45MB. I guess all that proves is that there's definitely something in 5.1.1 that causes the additional memory growth even during ping-pong, with otherwise identical code. I'll have to dig a little deeper into the changes between 5.0.0 and 5.1.1. I'm also rerunning the test with 5.0.3, just to make 100% sure it snuck in during the 5.0.x to 5.1.1 code.

I don't really fully grasp the resolver changes, but I'll spend a little time looking at it, maybe my brain will finally get it.

@tsightler
Copy link
Contributor Author

Have some news. I did a hackish thing and modified tuyapi 5.1.1 to make _sendPing() directly call client.write(buffer), instead of calling _send(), which, based on my meager understanding of the code, would mean that attempts to send the ping packet would not be retried, but, more importantly, would not includes a resolver. This appears to have completely eliminated the leak.

I definitely don't understand the resolver code, it seems like some attempt to deal with queuing multiple outstanding requests. But I do wonder if this code is actually important for simple ping. I'm considering modifying send to just skip adding the resolver if it's a ping command as, it appears to me, we just throw it away in the packetHandler code anyway.

@kueblc
Copy link
Collaborator

kueblc commented Jul 30, 2019

This is what I suspected. I think your fix should be "good enough" but in the long run I'd like to refactor the resolvers code.

The original intent was to match device requests and responses using sequence numbers so that Promise style callbacks worked naturally, but I don't think that panned out as expected. The device doesn't include the sequence number when sending a STATUS response to a set() command so that use case is moot.

For now we may just as well include the quick fix, and we'll probably work the larger refactor into v6.0.0 (#169) where we are already planning breaking changes.

@tsightler
Copy link
Contributor Author

tsightler commented Jul 30, 2019

Ignore this original post, that patch was dumb. I need to quit coding when I'm this tired!

I'll let you guys decide the best way to fix it. I've tried the simple modification I mentioned above (modified _sendPing() to use client.write() directly) and I've also tried modifying _send to include a flag (isPing) that is set by sendPing() which causes it to simply return resolve(true) instead of setting a resolver. Both methods seem to work OK.

I did see one other behavior that I couldn't explain with the client.write() modification. When I start the tuya-mqtt script I run a simple init that connects to 4 different devices, one every two seconds. Normally, this means that I see a ping for each device every 10 seconds, which means there's a 2 second interval between the ping to each device, the a 4 second pause, then the cycle restarts. For some reason, with this patch, they eventually all become synchronized where all pings and responses happened at the same time. I don't understand that and it could have been some fluke, but it was obvious.

So far, I haven't seen that behavior with the second approach.

@codetheweb
Copy link
Owner

Thanks for spending so much time trying to track down this bug @tsightler. I agree with @kueblc, sounds like the quick fix will work for now and we'll monitor memory usage during the v6 refactor to make sure more leaks don't appear.

If you guys get a PR opened before this weekend I can take a look, otherwise I'll probably be able to make the necessary changes myself over the weekend.

@codetheweb
Copy link
Owner

Resolved with #236.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants