Solving the Cardano node huge memory usage - done

@kirael I will be interested if you discover any improvements.

But, if you are running on a hosted VPS it might be difficult to reliably test because the service provider is likely to be over allocating resources. For example if you don’t have dedicated CPUs then you could have contention with another hosted user. Therefore your results could depend on the other user’s instantaneous load, which you can’t know.

2 Likes

Yes that’s true.

Well for now, i’m trying to figure out the best and simplest way to have better control over the way the memory is used, with -H and -O options. Without any options, as far as i understand, Live Memory grows until GC is done. Then Heap allocation grows, based on the last Memory Live size. But maybe i’m wrong. Not that easy to understand :slight_smile:

1 Like

Hello there o/

Finally, i’ve been running my BP without any Missed Slots (except during epoch change of course) with this settings :

-N6 --disable-delayed-os-memory-return --nonmoving-gc -A16M -AL96M -T -S -RTS

and the snapshot interval in the mainnet config :

“SnapshotInterval”: 86400,

It can run for days, without any problems. I know that maybe this might not work for everyone, but well, i wanted to share with you guys :slight_smile:

2 Likes

This is update 2 of 2 which I might as well share with the community though many operators will already have observed this for themselves. TL;DR yes I believe it is now no longer possible for either the BP or relay nodes to perform reliably on 8GB RAM.

The main reason for this on the relay is the huge spike in memory a bit in advance of the epoch boundary and which continues for hours afterward. After the Vasil HFC itself the 8GB relays needed about 2GB of swap even if rebooting about an hour after the boundary, so I quickly realised they were never going to fit into that footprint again & doubled the capacity of our node servers (RAM, vCPU & disk).

With less connectivity on the BP node, the 8GB usage wasn’t as affected by the epoch transition but still the memory demands increase as much as a day before & a day after that boundary, and even with GHC memory management to contain the memory size within 8GB most of the time, the GC would get page faults during that period which caused slots to be missed.

Hence the only solution in both cases was to provide more physical memory. It seems ironic that during much of the epoch our new 16GB nodes appear to use less memory than the 8GB nodes did (until sudden jumps-up after about 12 hours of operation, not coinciding with our ledger dumps)… which I attribute to some kind of time-space optimisation: either there’s a minimum amount of memory (slightly greater than 8GB) needed to stay ahead of that optimisation, or simply doubling the number of vCPUs (which we got when doubling our memory) also made this & every other process more memory efficient.

In any case here are the GHC RTS parameters which are getting our 16GB, 8-vCPU nodes comfortably through every epoch boundary without missed slots or any latency on the relay. I’ve reduced the heap size from @_2072’s original 16GB BP parameters, to stop the newer node from expanding larger than the 16GB boundary; these are conservative so -O and -M might even be bumped up again, as we always have several hundred MB RAM to spare:

  • relay: +RTS -N -I0.1 -Iw3600 -A32M -AL256M -n16m -F0.3 -O11G -M21G -c99 -Mgrace=1G -S./gc.out -RTS

  • BP: +RTS -N -I0.1 -Iw3600 -A32M -AL256M -n16m -F0.3 -O11.5G -M22G -c99 -Mgrace=1G -S./gc.out -RTS

7 Likes

responding to a query from @Meritus by Twitter PM, so it can be held up for review & benefit others in the same situation if it’s considered valid:

+RTS -N -I0.1 -Iw39600 -A32M -AL256M -n16m -F0.3 -O12G -M23G -c99 -Mgrace=1G -C0 -RTS

Upgrading to 24GB RAM, what might you change about that to make more use of the effort?

The -O option linearly relates to the amount of real memory on the node… with a slope of 1, so every increase in memory should increase the -O option by the same amount. I reverse engineered from the values @_2072 was using above that -O should be set to the difference between real memory and the size of “program” data used by the Cardano Node (roughly the amount of memory you see it using when it first starts up). This difference is the size to which the “heap” of GHC data can grow before a garbage collection is forced.

The value of 12GB above was calculated from (roughly) 4GB used by the Cardano Node itself taken away from our 16GB of real memory. However it feels more comfortable to reduce this threshold a bit to make sure there’s always at least ½GB of free memory, even across an epoch boundary when relays will use more memory… so this can be reduced by exactly the amount of memory you’d like to see kept free in the free command during peak periods.

So if I were to add 8GB of memory to a 16GB system, I would want to make 8GB more memory available to the “heap” and change the -O option to -O20G… or maybe 19.5G or lower, once I saw how close to the memory limit it was actually getting.

The -M option is an upper limit on heap size before the node tries to shut down. I assumed this option was added above in case the node runs away with heap memory allocation: to ensure it doesn’t crash leaving a corrupted data structure & long restart time. Again reverse engineering from the original author’s posts, I could see every change in -O was accompanied by a 2x increase in -M… which also makes sense to me because it maintains a safety margin for heap growth roughly equal to the usual heap size itself.

Setting -M too low could cause the node to crash prematurely, while it should be set no higher than the amount of real memory on the system plus the amount of swap (because obviously the “heap” needs to fit within the sum of real + virtual memory). So if you agree with this rationale so far, adding 8GB to the system would add 16GB to the -M figure (or maybe less) for a new value of -M39G on a 24GB system (with at least 16GB swap).


Those are the only two options I would change if the amount of memory on a system changed. Other SPOs trying these configurations for different memory sizes are welcome to post their experiences as follow up :nerd_face:

1 Like

Just a few precisions to avoid speculations:
The -O option just says to the runtime system to let the heap grow up until this size before making a full GC (this reduces the frequency of full GCs which is when blocks are missed - depending on the speed of your cpu)
The -M option does 3 things:

  • it sets an upper limit to the maximum heap size
  • It will always trigger a full gc when M/2 is reached, so if you set your M at 22GB a full GC will always be triggered every time the heap reaches 11GB
  • when the percentage indicated by the -c## option is reached, the RTS switches to a compacting GC mode (much slower).

What I wanted by using the -M option was the M/2 trigger behavior which prevents the heap from growing so that’s why I use the -c99 to disable the switch to a compacting GC.

Typically -M should be set to -O x2, the goal being to make full GCs as rare as possible and preventing the heap from growing needlessly. So the -Iw3600 is a bad idea as it will cause a full GC every hour… it should be set to 12 hours (a similar interval to the garbage intensive periodic snapshot process… that you can set with the SnapshotInterval node config option).

2 Likes

For consideration by the general audience: we kept the hourly garbage collection (-Iw3600) to allow our servers to run through the epoch without regular restarts. We do have fast CPUs that can do hourly garbage collections in less than a second of clock time, so the process doesn’t generally miss slots.

With an 11-hour interval, we would have to to restart our nodes regularly to keep reducing the heap to a minimum again. This part of our option usage is not speculative & results from correlating any missed slots with the GC output and logs of RAM usage.

We did try it with the 11-hour setting (-Iw39600) and the number of missed slots when it finally did a major GC was unacceptable by comparison, and I saw no practical way of timing this so it wouldn’t coincide with block production. I have found though that the 12-hour interval for ledger dumps works well both for the relay & BP.

1 Like

Amazing effort and work!!!

Since your last post, have you come across anything new that has the potential to enable continued usage of 8GB nodes?

yes… run your pool on one of the testnets instead of the Mainnet :sweat_smile:

2 Likes

Lmao brilliant!!

So we have essentially shuffled barriers of entry…all those who don’t have access to the financial system now need access to 24GBs of RAM either directly or indirectly.

1). As a pool operator, are you able to monitor memory capacity across your pool? Perhaps we could balance excess capacity. Or start an initiative that allows people to stake ADA in exchange for generating new memory capacity that could be pooled and made available to those who don’t have 24GB of RAM. The Foundation could temporarily pay a reward premium to those who stake to incentivize the program. Those who utilize this RAM could be required to pay back the resources over time via ADA to the foundation itself. This way we move burden and responsibility from the individual to the foundation while IOG and the community figure out a viable long term solution.

2).Do you know if anyone has explored snapshotting as means of lightening node loads?

3). Do you know if IOG formally forecasted RAM requirements? I’ve read a fair amount of specs and papers but do not recall that one off the top of my head.

I will do some research on 2 and 3 while awaiting your response.

With reference to point 2) Are you familiar with Mithril? Not 100% the same, but may be a solution to some in one way or another?

To summarize what is causing the important memory usage of the node:

  1. The ledger state (not sure if this is the right term) is kept in memory and thus continuously increases with each new transaction (this is what makes the difference in memory usage between the mainnet and testnets).
  2. The periodic snapshots are generated in CBOR (a binary equivalent of Json that still contains the keys of the data which is very inefficient in term of storage) and the method they use to generate it creates several gigabytes of data in memory that triggers the GC many times in a row.
  3. GHC garbage collector is archaic compared to other modern GCs (such as the ones available in the JVM - like the Shenandoah GC). The default one needs twice the amount of real RAM requirements to do its job, the alternative to the copying GC (the non moving one) has severe fragmentation issues which cause huge memory leaks (several gigabytes after a few hours).
    The non-moving-gc is also unable to pause the program when more garbage/s is created than it can handle in the same amount of time so unless you have very fast CPUs or more gc threads than mutator ones, with the non-moving gc the ram usage may skyrocket at each snapshot generation.
  4. GHC 8.10.x never gives back memory to the OS

1 - Can be improved by IOG by using memory mapped files. (this would solve 80% of the issue)
2 - Can be improved by IOG by:

    1. Optimizing the code that generate the snapshots to use much less memory.
    1. Storing the snapshots in a binary format that does not include the keys, just the raw data. (like in softwares coded in C where you can just dump the content of a struct pointer in a file and load it back straight into memory - not easy to do in higher level languages such as Haskell).

3 - Can only be improved by GHC dev team but garbage collectors are one of the most complex softwares on Earth so it will take years.
4 - Improvement is coming with GHC 9.x but with 9.2.8 I still have GHC RTS crashes after 7 days of continuous run time or so (with snapshots every 1.5h), so not there yet…

3 Likes

@_2072 Thank you for your very informative insights.

I have been using the nonmoving-gc with my low power ARM machine but I keep restarting my BP every day or so. Doing things this way ensures I get no missed slot leadership checks. With the standard “stop the world” gc, I get a few missed slot checks when a garbage collection event happens.

I compiled cardano-node using ghc 9.2.6 and the nonmoving-gc does now release memory back to the OS. However, I haven’t tried compiling with ghc 9.2.8. You say that your ghc 9.2.8 compilation crashes after 1wk running. I haven’t seen such crashes with my 9.2.6 compiled version even if I leave my node running for 2wks. However, I haven’t changed the snapshot interval from default. Do you think that compiling with ghc 9.2.8 is somehow less stable than ghc 9.2.6 or do you think I am just not seeing the crashes because my snapshot interval wasn’t as frequent as yours?

I read somewhere that IOG was planning to implement a new memory model for some part of the software. Maybe that was the “binary format not including the keys” you speak of??? Do you have any knowledge about such plans?

You’re welcome!

I don’t know if 9.2.6 is more stable than 9.2.8, it’s a rare random crash that usually happen during snapshots. I’ll try to compile the node with 9.2.6 it to see what happens.
Note that I use aggressive release memory settings but I’m also using the default snapshot interval.
(I use -I 0.001 -Iw 1800 -Fd 0.5 where the -I is there to make sure that a garbage is performed every 30 minutes)

I heard them talk about it in a SPO Discord call a very long time ago but the idea was indeed to keep the ‘schematic’ (I don’t remember the exact term they used) outside of the stored data which is a new thing for evolved languages but was the default for program in C or C++ with direct access to RAM.

1 Like

I remember reading some of your excellent posts over a year ago and I experimented with lots of different RTS settings when I was trying to minimise missed slot leader checks on my lower power arm machine. With the nonmoving-gc on, all the other RTS settings I tried didn’t seem to add any benefit that I could see. But, maybe I was just too gumby to notice?

On compiling cardano-node: I have been compiling against the Debian version of libsecp256k1-1 which is later than the git commit that IOG specifies. IOG’s specified secp256k1 commit is ac83be33 and this is dated Sun Feb 27 18:28:12 2022. Tagged version 0.2.0, which is what Debian is using in the libsecp256k1-1 deb package, is dated Mon Dec 12 17:00:52 2022. There are a huge number of commits between the IOG specified commit and this 0.2.0 tagged version.

I haven’t seen any problem with my compiled cardano-node. i have been running all my relays and my BP, on mainnet, with cardano-node compiled this way since cardano-node version 1.35.0. What do you think about this? I guess maybe there could be a bug in the specified IOG version which is not in my secp256k1 library which could cause my nodes to reject a transaction that other nodes allow???

In deed. the non moving gc does not benefit as much as the copying gc from these RTS parameters because the non-moving gc is non-blocking so it will not be able to free the memory fast enough to prevent the memory footprint from increasing anyway (when ledger snapshots are created).

I’m not sure about this, I’m using the version specified by IOG. I think that if it were not working you would have had issues and wouldn’t have been able to synchronize your nodes to the current epoch… So it must be OK.

1 Like

That was exactly my thoughts. The IOG team probably know that none of the subsequent secp256k1 bug fixes / changes affect the code portions they use in cardano-node.

Has anyone been running these (or any other custom RTS config) on recent versions?

Keen to know if we can be more frugal with our resources.

T

I use the same RTS configuration as I last posted above (adjusted for the 24G RAM footprint, which we still have for both BP & relays) and it still works for the latest release… though BP needs to be rebooted every couple weeks or so as it gradually leaks memory & starts to use swap space.

@Terminada has long been testing the --nonmoving-gc option which I am leery about because of bad performance on builds with older versions of GHC. That problem may have been fixed by now… but any performance improvement on our own system would be modest (currently missing about 1 in every 4000 slot leader checks due to GC) so I’m waiting until node version 9 to test that again. :face_with_monocle:

1 Like

This is what a low powered arm64 machine (raspberry pi 4 equivalent but with 32G RAM) looks like after running 237 hrs without restart:

cardano-node 8.7.3 - linux-aarch64 - ghc-9.2                                    
git rev a4a8119b59b1fbb9a69c79e1e6900e91292161e7                                
Node Started: Fri 23 Feb 2024 23:35:43 AEST (Running: 237 hrs 25 mins)          
RTS settings: cardano-node +RTS -N --nonmoving-gc -RTS                                                           
Memory use:                                                                     
               total        used        free      shared  buff/cache   available
Mem:        32812692    23756096     2678484          72     6719108     9056596
Swap:       32260088     3230520    29029568

This is an identical machine running as a block producer 43 minutes after restarting cardano-node:

cardano-node 8.7.3 - linux-aarch64 - ghc-9.2
git rev a4a8119b59b1fbb9a69c79e1e6900e91292161e7
Node Started: Thu 07 Mar 2024 22:54:58 AEST (Running: 0 hrs 43 mins)
RTS settings: cardano-node +RTS -N --nonmoving-gc -RTS
Missed slots: null
Memory use:
               total        used        free      shared  buff/cache   available
Mem:        32812692    10353944    14877036          88     7924104    22458748
Swap:       32260088       33260    32226828

cardano-node was compiled with ghc 9.2.8.

I usually run my block producer on an arm64 machine and it gets zero missed leadership slot checks so long as I restart it every day or so. If I let the block producer run for several days non-stop then it gets a few missed leadership slot checks around the 36hr mark, and additional missed slot checks every day or so thereafter. I guess there is memory fragmentation that builds up when using the nonmoving-gc Haskell garbage collector which makes it slower over time to do its garbage collection.

However, if I run this lower powered arm64 machine using the default “stop the world” Haskell garbage collector then I get even more missed leadership slot checks. No doubt there is less memory fragmentation using the “stop the world” garbage collector, but it takes my arm64 processor too long to copy all the data to the new memory area when the garbage collector decides to reallocate the memory. Thus the missed leadership slot checks pile up when the garbage collector runs, and I can’t control when Haskell decides to run the garbage collector.

The nonmoving gc is more manageable for my needs since it doesn’t “stop the world” for as long, and I can control any slowness due to memory fragmentation by restarting cardano-node to ensure it has contiguous memory a few hours before block production.

1 Like