Solving the Cardano node huge memory usage - done

OzTrada · 23 December 2021 04:18

Thanks for the extraordinary research.

From what I see with testing on my hardware, the nonmoving-gc seems to not cause missed slot checks at all, until memory usage gets high. Whereas the copying gc results in 1-2 missed slots often, but not always, whenever it runs.

I can just let the block producer node run with the nonmoving-gc for a couple of days with no missed slot checks. However the system starts running slower presumably because the nonmoving-gc doesn’t manage the memory as well (memory use higher and maybe more fragmented?).

For example, I have been letting my Intel Xeon E-2276ML node with 2 vCPUs, 16Gb RAM, 16GB swap run for 3 days, without restart, since before the epoch transition. It only got missed slots during the epoch transition and none since.

However, it now seems to be running slowly because I just re-ran a leaderlog for current epoch and this caused 15 missed slot checks. Whereas running a leaderlog does not normally result in missed slot checks on my setup.

The node is running with “+RTS -C0 -N -I0 --nonmoving-gc -RTS”

Memory usage is high but stable:
Tue 21 Dec 2021 10:04:57 (0.5 days after start)

cardano-node +RTS -C0 -N -I0 --nonmoving-gc -RTS
               total        used        free      shared  buff/cache   available
Mem:        16393504    15851756      161420          20      380328      262232
Swap:       17039352    13271360     3767992

Thu 23 Dec 2021 12:44:11 (2.5 days after start)

               total        used        free      shared  buff/cache   available
Mem:        16393504    13827724      173360          20     2392420     2276652
Swap:       17039352    13417816     3621536

In summary: I think that running with nonmoving-gc means that ledger snapshots and haskell garbage collections don’t cause missed slots. However, the trade off is that memory usage is higher (and possibly more fragmented?) and this eventually can result in the node running slower and missing slot checks later if put under additional load.

By the way, running with nonmoving-gc does not result in crashes on my servers.

Topic		Replies	Views
Ups and downs of performance Operate a Stake Pool	10	542	18 August 2021
High memory usage Operate a Stake Pool	7	544	1 September 2022
Speed up BP and relay nodes Operate a Stake Pool	1	835	19 August 2021
Relay using a lot of CPU, a lot Setup a Stake Pool	34	1359	10 November 2021
Is my BP node running normal? Setup a Stake Pool	10	781	21 September 2022

Solving the Cardano node huge memory usage - done

Related topics