Solving the Cardano node huge memory usage - done

OK, new discovery: The ~130Mb allocations I observed above are indeed related to the -A flag and are in fact explained here: -AL ⟨size⟩. it corresponds to the large object allocation area (which is global and independent of the number of CPUs set with -N, contrary to -A). It is set to the -A value by default.
I could indeed confirm that as @orpheus-ant explained, every 72 minutes the node will allocate about 2.5Gb of memory triggering many small GCs and potentially a big GC if it manages to exhaust the older generation.
There is no RTS malfunction with the -O flag, just a coincidence (as explained above by @orpheus-ant )

So I’m am now trying with an -AL set to N*A this will divide the number of small GC and remove useless generation promotions during these big memory allocations by 4 in my case (since -N4) and enhance the performance of the process during these snapshots!
I’ve just restarted my producer so I will report in 15 hours… (I’m using -N4 --disable-delayed-os-memory-return -A256M -AL1G -n16m -F1 -O18G -M20G -Mgrace=1G -C0 -c99 -T -S in this test - I could confirm that -Mgrace allows the node to exit gracefully in case of heap exhaustion set with -M)

1 Like

I’m not sure about the 12 hours, never heard that before. But the 72 mins is accurate. Apart from that a snapshot is taken at least every 50,000 blocks.

4320 seconds is actually the security parameter times two. (k=2160)

https://input-output-hk.github.io/ouroboros-network/ouroboros-consensus/src/Ouroboros.Consensus.Storage.LedgerDB.DiskPolicy.html#defaultDiskPolicy

https://input-output-hk.github.io/ouroboros-network/ouroboros-consensus/Ouroboros-Consensus-Config-SecurityParam.html#t:SecurityParam

You can change this by setting SnapshotInterval in node configuration.

1 Like

The 12h was just a reference to your post. Ledger snapshot calculations are not happening immediately after node startup, there is a delay. Not sure exactly how this delay from startup is calculated though.

Regarding missed slot leader checks, major GCs are a major contributor but not the only one. At least in my case, during a snapshot event, I have more missed slot leader checks than seconds spent doing GC. You can check to see if that is true in your case too (it only started since epoch 304). If it is, you are suffering from the same issue and thought the GC is the only reason. If it is not, then your CPU may be fast enough to handle everything fine on a single core. Mine is not, I am running on embedded bare-metal energy-efficient systems (with fast SSDs, so I/O is not causing any delay).

Regardless, the node could use some improvement regarding multi-core efficiency, now that CPU utilization is non-trivial, hence my GitHub issue.

@_2072: sounds good, give us an update on how this goes. Not sure why you set -F1 though. Is this your way of saying “ignore -F and trigger major GCs only based on -O”?

@weebl2000: the 12h mark was a reference to @_2072’s post. I just know there is a delay before snapshot events start. If you have any info on how that is calculated precisely, please share.

Yes normally -F is ignored until -O is reached, afterward, if F * ‘Size of last live data at last garbage collection’ is used if it’s bigger. Basically with -F1, I’m saying to the RTS that once you’ve reached the O limit there is not need to allocate more than we already have to delay the next GC…

Anyway I need to make another try without the -M option because it seems that something else is triggering a GC well below the -O limit (at exactly half of -M). It is said in the documentation that stuff is done when the M limit is “approaching”…

I could confirm that the snapshot does not trigger any missed slots if no major GC is performed from all the ‘garbage’ it creates.

The -AL1G flag does seem to improve things, less little GC, less promotion but the main issue remains… (2.5GB of generated data by snapshots that will have to be collected eventually)

With 2.5Gb of garbage generated every 72 minutes, puting the garbage into the swap instead of collecting it will not be a practical solution, we are just buying time.
The best option would be for IOG to implement a POSIX Signal to trigger a GC (like they did to reload the topology configuration) so that we can trigger major GC manually when we know it’s safe. I’ve just made a feature request for this: [FR] - Implement a Posix signal to trigger a major Garbage Collection from the outside · Issue #3394 · input-output-hk/cardano-node · GitHub

I will focus on the more than 2 generations approach to segment the garbage collection in several shorter operations…

1 Like

You can get the specifics in the DiskPolicy that I linked in my previous post.

It boils down to:

  1. On first start, a snapshot is taken as soon as the node sees 2160 (k) “blocksSinceLast”.
    image

This means the number of immutable blocks that had to be applied to get to the chain tip.

  1. After that, if we have made a snapshot already since last node startup it is either:

time since last snapshot (since startup) has been passed (default 72 min) OR if enough blocks have passed since last snapshot (hard-coded 50,000) AND those 50,000 blocks have not passed within 6 minutes (initial sync is happening, don’t snapshot more often than once every 6 minutes).

So, hypothetically: ledger snapshot + immutable DB are in perfect sync on startup, it will take k blocks before a snapshot happens. Or around 12 hours with current settings.

After a first snapshot is made (maximum of about 10 hrs in), it defaults to snapshotting every 72 minutes. If you have decent CPU/memory to replay blocks - it might be worth a try to set SnapshotInterval to 43200. Which would mean it will always take around 12 hrs between snapshots.

Hopefully this clears things up.

2 Likes

There is another thing to try if someone else has time: add a high -H with -O so that the allocation area can be very big and even contain the full snapshot when it is created this would avoid promoting temporary data from the snapshot only for them to be collected during the next major GC. Something like this:

-N4 -I0 -A256m -AL1G -n16M -F0.5 -H8000M -O8000M -T -S (for a server with 4 cores and 16gb of ram). -F0.5 will prevent the heap from increasing too much.

I’ll try this myself in a few days but for now I’m testing with -A256M -AL1G -n16m -G3 -F1.5 -H4G -O4G -M14G -Mgrace=1G -C0 -c99

1 Like

Yes, it does. Thanks!

Quick question, when using -S option to display statistics how can you actually see them? Where is the output directed?

In the standard error output but this option also supports a specific file if you prefer

I am still confused about the -T parameter. Is it used by the service to do some optimization or is it just for logging purposes.

Just logging.

I put an answer to this on Cardano StackExchange:

Here is the content of that answer re-posted in the hope that contributing will help the search for a fix to this problem:

The things I have found:

-I0

(Default garbage collector with idle garbage collection off):

With this set I don’t get missed slots at all for many hours and then eventually the garbage collector has to run and copy a lot of memory and then I get many missed slots all at once.

-I0.3 -Iw600 

(Default garbage collector with regular collector sweeps with minimum 10mins between sweeps when processor idle > 0.3 secs)

This results in few missed slots which occur when the garbage collector runs. However, because it is done more regularly the amount of memory copying is less and it takes less time so there are less missed slots each time. But for my setup there were still a couple of missed slots every 10 mins when the GC ran.

--nonmoving-gc

I still get occasional missed slots. Memory use seems to gradually increase over time and the node eventually uses all available RAM 16G and starts using swap. After running 20hrs rts_gc_peak_megabytes_allocated is 24G which seems excessive. Node appears to be running OK with almost no missed slots (4 in 20hrs).

If you run a small pool with limited processor power/speed then one option is to do the following:

Work out when your allocated slots are by running cncli-leaderlog ahead of time. Then plan to restart your block producer cardano-node service 20 minutes before you are due to produce your block. This way you can be sure your memory is fresh and garbage collection won’t be necessary for a while just at the time when you are due to produce your block.

Overall, throwing faster hardware at the problem is an easy fix. Faster processor means memory copying is quicker and garbage collection takes less time so missed slots are eliminated.

Update 2021-12-21:

After lots of testing I have concluded that the only RTS option which stops my node from missing slot leader checks is --nonmoving-gc. However the problem with this setting is that memory is never released back to the OS with GHC version 8.10.7. Unfortunately we need to wait until the cardano-node is able to be compiled using GHC version later than 9.2.x

What I currently use is:

+RTS -N --nonmoving-gc -RTS

As I understand things, the other settings compiled in (set by IOG in the source code) are also enabled. These include: “-I0 -A16m”. You can check what your binary has with:

cardano-node +RTS --info
The output for my version currently is:

(“Flag -with-rtsopts”, “-T -I0 -A16m -N2 --disable-delayed-os-memory-return”)

Then I restart cardano-node every few days at a convenient time between block production.

Or check your memory usage periodically with: ‘free’
And restart cardano-node when you get close to exhausting your RAM, or after this but before too much swap is used.

If you just let cardano-node keep running using the nonmoving-gc then eventually all the RAM and swap is used and the OS kills the process using its out of memory reaper. I believe this is because GHC version 8.10.7 doesn’t return the collected memory to the OS.

Addition 2021-12-21:

OK now I am not sure about the last bit. Memory is eventually returned to the OS when using nonmoving-gc. Here are some outputs of ‘free’ while running continuously with:
‘+RTS -C0 -N -I0 --nonmoving-gc -RTS’

Tue 21 Dec 2021 10:04:57

               total        used        free      shared  buff/cache   available
Mem:        16393504    15851756      161420          20      380328      262232
Swap:       17039352    13271360     3767992

Tue 21 Dec 2021 22:02:44

               total        used        free      shared  buff/cache   available
Mem:        16393504    15325260      648972          20      419272      786832
Swap:       17039352    11186532     5852820

Note that 12hrs has elapsed and RAM and Swap usage has reduced a little.

5 Likes

Addition: I am running tests on two different machines:

First is:
Intel(R) Xeon(R) E-2276ML CPU @ 2.00GHz
With 6 cores and 2 threads per core = 12 CPUs
64Gb total RAM

Second is:
ARM Cortex-A72 @ 2.00GHz
With 16 cores and 1 thread per core = 16 CPUs
32Gb total RAM (Will upgrade to 64Gb soon)

In both instances, I am running cardano-node in kvm virtual machines with 16Gb RAM and 16Gb swap. The ARM VMs have 4 vCPUs whereas the Intel Xeon VMs have 2 vCPUs.

Here are the results of my experiments:

From all my tests. the best settings for 8GB relays are:
-N2 -I0.1 -Iw3600 -A64M -AL128M -n4m -F1.1 -H3500M -O3500M -S

  • -I0.1 -Iw3600 turns on the idle GC to be triggered every hour as soon as the process is idle for 0.1s. This is useful because it prevents major GCs to be caused only by heap exhaustion which is guaranteed to increase the heap size (total memory usage) every time it happens (depending on the -F setting being used)

  • -A64 -AL128M sets the allocation area to 128M (for N=2, if N=4 it would be -A32M -AL128M). You want this allocation area to be small so that most of the memory is available for generation 1. -AL is useful to optimize the use of the allocation area when big allocations are made ; it should be set to A times N (it defaults to A which is suboptimal for parallel processes)

  • -n4m divides the allocation area into smaller 4 megabytes blocks (this setting does not have much impact with N=2 and could probably be left out)

  • -F1.1 this limits the growth of the memory when a GC is caused by heap exhaustion, this is key to limit the increase of memory usage, this setting works because of the high -O where 10% of 3500M is already 350M, Using lower than 1 values with -F is not effective and can lead to “GC loops” were the node would spend more time doing garbage collection than executing code.

  • -H3500M enables a dynamic allocation area until the amount of live data reaches 3500M, this helps limit the number of GC (small and major), especially at startup because less data will have to be promoted from generation 0 to generation 1 (since generation 0 GCs will be rarer).

  • -O3500M will delay a GC of the generation 1 until it reaches this amount of data. This setting, combined with -F1.1, limits the growth of the total heap size. Memory will keep increasing but very slowly (you can have a relay run for more than 30 days without problem)

With the above settings, the memory footprint stays very close to 8GB with a minimum amount of swapping. GC pauses are usually between 1s and 2s which is acceptable for a relay.

Note that, currently, if you restart a relay within the reward computation period it will not behave well and use a bigger memory footprint compared to a relay started before…

To further limit the memory usage you can increase the ledger snapshot interval as @weebl2000 suggested (12 hours is OK), this is important for very cheap relays such as Raspberry Pi because they are really slow and may end up becoming unresponsive for tens of second while they swap to execute their GC…

The important thing to monitor, with the -S setting, is the time spent doing GCs (the 5th column), ideally these delays should be less than a second (mandatory for a producer or you will miss a leader check every time it’s >1s)

Now for my producer I use the following settings:

-N4 -I0.1 -Iw39600 -A64M -AL256M -n16m -F0.3 -O12G -M23G -c99 -Mgrace=1G -C0 -T -S

Here I use a very big -O (and so I can use a -F < 1), since it has 16Gb of RAM, I use a bigger allocation area. I’ve also set a maximum heap (-M23G -c99 -Mgrace=1G) it’s not much because I fear the heap to reach this limit but because one of the side effects of setting -M is that it will trigger a major GC as soon as M/2 of live data is reached (11.5 GB in this case) which makes a predictable major GC and limit the potential growth of the heap,

Anyway those settings allow the node to run without doing a single major GC for approximately 11 hours or until the first ledger snapshot is made so for approximately 12 hours if the node is restarted after it did at least one ledger snapshot (snapshots are the files located in the ledger sub-directory of the database directory).

I’ve made it so my management software monitors the snapshots and knows when a block is due and will restart the node accordingly to make sure that lead blocks fall within this 11 hours window of tranquility.

It is said in the GHC 9.2.1 release notes that much improvements were done to the parallel GC which means that this missed slot problem may go away once the node will be buildable using GHC 9. IOG also says that they did much improvements to the ledger memory footprint in version 1.33 of cardano-node. So things are going to improve anyway.

I’ve made some other experiments with the non-moving GC (-xn) but, as I said before, this setting is not reliable and may crash the node or use an enormous amount of memory (especially when the parallel GC cannot keep up with the amount of garbage being produced).

In my previous posts I mentioned other experiments with more than 2 generations but these were not successful, the problem remains the same: the periodic ledger snapshots generate several gigabytes of garbage each time which will have to be collected eventually, there is no way around that, we can either delay the issue or mitigate it as much as possible but this is is really an optimization issue for the cardano-node developers to work on (which they may have done in version 1.33, but since it’s not released I haven’t tested yet).

3 Likes

Thank you all for sharing this info…

My missed slot leader checks have been very high. I made the basic “+RTS -N - RTS run” change on my BP and will monitor during this epoch 310…
In the last 15 mins of epoch 307 my missed leader checks went from 164 to 467. Could not figure out what was going on. My BP was just upgraded to I9 16 core and 64 gb ram…
Everything you have been posting should help my situation…

Great info

Wait for the next release (1.33.0) it will use less RAM/CPU

From discord
:point_down:
The keen-eyed among you may have spotted a new node version (1.33.0) which we tagged today. I’d like to thank everyone in the community involved in testing the release candidates on the public testnet.

Our SRE team is slowly rolling this out to our own mainnet nodes.

HOWEVER please note this is NOT being recommended for general usage just yet so we are not officially releasing this on GitHubfor now. We are confident in the code, and benchmarking suggests significantly improved speed and performance (including a reduction in memory and much more consistent CPU usage during the reward calculation window).

However, issues can arise and since we won’t have the full DevOps /engineering teams working over the holiday period, we plan to make this an official release - and support it officially – from early January. Thanks for your understanding and patience on this! And have a very Merry Christmas and a Happy New Year!

For the startblockproducingnode.sh change? That was all that I adjusted today…

Or are you saying, just wait for the 1.33 and the GHC change… Both of those should fix many issues for everyone…

The results of these settings and the research behind it are both extraordinary, and my explicit thanks is long overdue for all the work you’ve done on this issue this year :heart_eyes::pray:

1 Like

Thanks for the extraordinary research.

From what I see with testing on my hardware, the nonmoving-gc seems to not cause missed slot checks at all, until memory usage gets high. Whereas the copying gc results in 1-2 missed slots often, but not always, whenever it runs.

I can just let the block producer node run with the nonmoving-gc for a couple of days with no missed slot checks. However the system starts running slower presumably because the nonmoving-gc doesn’t manage the memory as well (memory use higher and maybe more fragmented?).

For example, I have been letting my Intel Xeon E-2276ML node with 2 vCPUs, 16Gb RAM, 16GB swap run for 3 days, without restart, since before the epoch transition. It only got missed slots during the epoch transition and none since.

However, it now seems to be running slowly because I just re-ran a leaderlog for current epoch and this caused 15 missed slot checks. Whereas running a leaderlog does not normally result in missed slot checks on my setup.

The node is running with “+RTS -C0 -N -I0 --nonmoving-gc -RTS”

Memory usage is high but stable:
Tue 21 Dec 2021 10:04:57 (0.5 days after start)

cardano-node +RTS -C0 -N -I0 --nonmoving-gc -RTS
               total        used        free      shared  buff/cache   available
Mem:        16393504    15851756      161420          20      380328      262232
Swap:       17039352    13271360     3767992

Thu 23 Dec 2021 12:44:11 (2.5 days after start)

               total        used        free      shared  buff/cache   available
Mem:        16393504    13827724      173360          20     2392420     2276652
Swap:       17039352    13417816     3621536

In summary: I think that running with nonmoving-gc means that ledger snapshots and haskell garbage collections don’t cause missed slots. However, the trade off is that memory usage is higher (and possibly more fragmented?) and this eventually can result in the node running slower and missing slot checks later if put under additional load.

By the way, running with nonmoving-gc does not result in crashes on my servers.

1 Like