OOM reaping: Tx spam using large (~100kb) transactions results in excessive memory usage #77

chainum · 2019-10-01T19:15:29Z

Describe the bug
Tx spam using large transactions (~100kb) results in excessive node memory usage.

This could happen because of:

Unbounded/non-pooled Go subroutines eventually spinning out of control and crashing the nodes, and/or
Excessive in-memory allocation of certain arrays/slices/queues (the pendingTransactions queue is especially interesting).

Issue has already been reported to @harmony-ek in the P-OPS Telegram channel but after discussion with @AndyBoWu today on Discord I was asked to open an issue for this.

There's already a related open issue for the unbounded/non pooled subroutines here: harmony-one/harmony#1645

It's also very possible that the slices/arrays that keep track of pending transactions, cx receipts etc. could be a part of the problem. Some slices/arrays (especially pendingTransactions) store all of the tx data in memory, and if people routinely spam ~100kb transactions the in-memory consumption of the node process increases quite fast.

Assuming a 20k tx pool limit for pending transactions (I've routinely seen Pangaea node operators with 15-17k pendings transactions in their queues) some nodes might end up storing gigabytes of data in memory for that queue alone. 15k pending transactions (including all of the tx data/embedded base64-data) could theoretically end up storing 1.5gb of pending transaction data in memory, given all transactions are ~100kb.

Add the other slices/arrays/queues to the mix coupled with the unbounded Go subroutines and it's probably no surprise that even the explorer m5.large instances (with 7.5gb available memory) experience the OOM reaping.

After restarting the harmony process on a m5.large explorer node, the process typically only manages to stay alive for 15-20 minutes before getting OOM-reaped by the OS. The process usually caps out at 7.1-7.2 gb of memory before the OS reaps the process and a tracelog is outputted, e.g: https://gist.github.com/SebastianJ/ad569b1ce48742b2a06117d6c273fa3a

(Tracelog seems to indicate unbounded subroutines being a major issue)

To Reproduce
Steps to reproduce the behavior:

Spam the network with transactions with a lot of base64 embedded data (90-100kb), e.g: https://gist.githubusercontent.com/SebastianJ/50c1405109d64651e13958d82eae112c/raw/fbbbdf598dd1bfd533f4d944f10f0176f71cb8c2/HugeTxExample (just put it in a loop or something to make sure it constantly spams the network)
Let the network start processing these transactions
Wait for nodes to start getting OOM reaped (seems explorer nodes get OOM-reaped way earlier than regular nodes with lesser memory - guessing this is because explorer nodes perform more processing and memory intensive tasks)

Expected behavior
Network should be able to cope with a massive amount of transactions, both in terms of number of transactions as well as the size of transactions in terms of bytes.

Environment (please complete the following information):
Explorer nodes:

OS: Linux 4.14.77-81.59.amzn2.x86_64 Script to send TXs #1 SMP Mon Nov 12 21:32:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Instance type: m5.large (7.5gb available memory)
Harmony binary version: Harmony (C) 2019. harmony, version v4696-pangaea-20190924.0-0-ge3030c50 (ec2-user@ 2019-09-24T23:30:06+0000)

Additional info
Just as an experiment I upgraded all explorer nodes to use Systemd units to start the harmony binary so that they would auto-restart upon getting OOM-reaped.

So far that has only worked for the shard 1 explorer. None of the other explorer nodes manage to sync or display blocks properly on the explorer Web UI - they seem to get stuck in a perpetual state of trying to sync and then losing the sync status when they get OOM-reaped. Shard 1 somehow manages to get past this state.

There's also a related issue regarding large transactions and the Web UI here: harmony-one/harmony#1676

chainum · 2019-10-11T17:33:50Z

Potentially fixed by harmony-one/harmony#1710

Need to perform a stress test on a network built using that patch to verify.

AndyBoWu mentioned this issue Oct 1, 2019

Too many go routines causing OOM harmony-one/harmony#1645

Closed

chainum mentioned this issue Dec 12, 2019

Mass tx spam results in nodes using an excessive amount of memory (and eventually getting OOM reaped) multiversx/mx-chain-go#799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM reaping: Tx spam using large (~100kb) transactions results in excessive memory usage #77

OOM reaping: Tx spam using large (~100kb) transactions results in excessive memory usage #77

chainum commented Oct 1, 2019 •

edited

Loading

chainum commented Oct 11, 2019 •

edited

Loading

OOM reaping: Tx spam using large (~100kb) transactions results in excessive memory usage #77

OOM reaping: Tx spam using large (~100kb) transactions results in excessive memory usage #77

Comments

chainum commented Oct 1, 2019 • edited Loading

chainum commented Oct 11, 2019 • edited Loading

chainum commented Oct 1, 2019 •

edited

Loading

chainum commented Oct 11, 2019 •

edited

Loading