Solving the Cardano node huge memory usage - done

Thanks for the extraordinary research.

From what I see with testing on my hardware, the nonmoving-gc seems to not cause missed slot checks at all, until memory usage gets high. Whereas the copying gc results in 1-2 missed slots often, but not always, whenever it runs.

I can just let the block producer node run with the nonmoving-gc for a couple of days with no missed slot checks. However the system starts running slower presumably because the nonmoving-gc doesn’t manage the memory as well (memory use higher and maybe more fragmented?).

For example, I have been letting my Intel Xeon E-2276ML node with 2 vCPUs, 16Gb RAM, 16GB swap run for 3 days, without restart, since before the epoch transition. It only got missed slots during the epoch transition and none since.

However, it now seems to be running slowly because I just re-ran a leaderlog for current epoch and this caused 15 missed slot checks. Whereas running a leaderlog does not normally result in missed slot checks on my setup.

The node is running with “+RTS -C0 -N -I0 --nonmoving-gc -RTS”

Memory usage is high but stable:
Tue 21 Dec 2021 10:04:57 (0.5 days after start)

cardano-node +RTS -C0 -N -I0 --nonmoving-gc -RTS
               total        used        free      shared  buff/cache   available
Mem:        16393504    15851756      161420          20      380328      262232
Swap:       17039352    13271360     3767992

Thu 23 Dec 2021 12:44:11 (2.5 days after start)

               total        used        free      shared  buff/cache   available
Mem:        16393504    13827724      173360          20     2392420     2276652
Swap:       17039352    13417816     3621536

In summary: I think that running with nonmoving-gc means that ledger snapshots and haskell garbage collections don’t cause missed slots. However, the trade off is that memory usage is higher (and possibly more fragmented?) and this eventually can result in the node running slower and missing slot checks later if put under additional load.

By the way, running with nonmoving-gc does not result in crashes on my servers.

1 Like