Solving the Cardano node huge memory usage - done

Here are the results of my experiments:

From all my tests. the best settings for 8GB relays are:
-N2 -I0.1 -Iw3600 -A64M -AL128M -n4m -F1.1 -H3500M -O3500M -S

  • -I0.1 -Iw3600 turns on the idle GC to be triggered every hour as soon as the process is idle for 0.1s. This is useful because it prevents major GCs to be caused only by heap exhaustion which is guaranteed to increase the heap size (total memory usage) every time it happens (depending on the -F setting being used)

  • -A64 -AL128M sets the allocation area to 128M (for N=2, if N=4 it would be -A32M -AL128M). You want this allocation area to be small so that most of the memory is available for generation 1. -AL is useful to optimize the use of the allocation area when big allocations are made ; it should be set to A times N (it defaults to A which is suboptimal for parallel processes)

  • -n4m divides the allocation area into smaller 4 megabytes blocks (this setting does not have much impact with N=2 and could probably be left out)

  • -F1.1 this limits the growth of the memory when a GC is caused by heap exhaustion, this is key to limit the increase of memory usage, this setting works because of the high -O where 10% of 3500M is already 350M, Using lower than 1 values with -F is not effective and can lead to “GC loops” were the node would spend more time doing garbage collection than executing code.

  • -H3500M enables a dynamic allocation area until the amount of live data reaches 3500M, this helps limit the number of GC (small and major), especially at startup because less data will have to be promoted from generation 0 to generation 1 (since generation 0 GCs will be rarer).

  • -O3500M will delay a GC of the generation 1 until it reaches this amount of data. This setting, combined with -F1.1, limits the growth of the total heap size. Memory will keep increasing but very slowly (you can have a relay run for more than 30 days without problem)

With the above settings, the memory footprint stays very close to 8GB with a minimum amount of swapping. GC pauses are usually between 1s and 2s which is acceptable for a relay.

Note that, currently, if you restart a relay within the reward computation period it will not behave well and use a bigger memory footprint compared to a relay started before…

To further limit the memory usage you can increase the ledger snapshot interval as @weebl2000 suggested (12 hours is OK), this is important for very cheap relays such as Raspberry Pi because they are really slow and may end up becoming unresponsive for tens of second while they swap to execute their GC…

The important thing to monitor, with the -S setting, is the time spent doing GCs (the 5th column), ideally these delays should be less than a second (mandatory for a producer or you will miss a leader check every time it’s >1s)

Now for my producer I use the following settings:

-N4 -I0.1 -Iw39600 -A64M -AL256M -n16m -F0.3 -O12G -M23G -c99 -Mgrace=1G -C0 -T -S

Here I use a very big -O (and so I can use a -F < 1), since it has 16Gb of RAM, I use a bigger allocation area. I’ve also set a maximum heap (-M23G -c99 -Mgrace=1G) it’s not much because I fear the heap to reach this limit but because one of the side effects of setting -M is that it will trigger a major GC as soon as M/2 of live data is reached (11.5 GB in this case) which makes a predictable major GC and limit the potential growth of the heap,

Anyway those settings allow the node to run without doing a single major GC for approximately 11 hours or until the first ledger snapshot is made so for approximately 12 hours if the node is restarted after it did at least one ledger snapshot (snapshots are the files located in the ledger sub-directory of the database directory).

I’ve made it so my management software monitors the snapshots and knows when a block is due and will restart the node accordingly to make sure that lead blocks fall within this 11 hours window of tranquility.

It is said in the GHC 9.2.1 release notes that much improvements were done to the parallel GC which means that this missed slot problem may go away once the node will be buildable using GHC 9. IOG also says that they did much improvements to the ledger memory footprint in version 1.33 of cardano-node. So things are going to improve anyway.

I’ve made some other experiments with the non-moving GC (-xn) but, as I said before, this setting is not reliable and may crash the node or use an enormous amount of memory (especially when the parallel GC cannot keep up with the amount of garbage being produced).

In my previous posts I mentioned other experiments with more than 2 generations but these were not successful, the problem remains the same: the periodic ledger snapshots generate several gigabytes of garbage each time which will have to be collected eventually, there is no way around that, we can either delay the issue or mitigate it as much as possible but this is is really an optimization issue for the cardano-node developers to work on (which they may have done in version 1.33, but since it’s not released I haven’t tested yet).

3 Likes