Solving the Cardano node huge memory usage - done

After months of research, worry and despair I’ve finally solved my relays and producer memory usage problem where some of my relays would end up using all of their RAM and start swapping and suffering from IO delays until restarted…

I’ve hesitated a long time before making this post because why in the hell would I share a competitive advantage like this one? Since all SPOs are competing against each other, in a very harsh environment where the rich still have a very unfair advantage over hard working SPOs, I should have no incentive to share my findings but I already lost nearly all my delegates due to the low stake of my pool and the long streaks of epoch without blocks, so what have I to lose?

Moreover, keeping this kind of solution for myself goes against my core values based on sharing and truth. Someone said that being honest in a world of thieves is being stupid ; but well, maybe, if enough people start to behave honestly the thieves will be the one who will look stupid in the end…
The same person also said something else: “What I give, I give it to myself, what I don’t give, I take it from myself. Nothing for me that is not for the others”. (“Lo que doy, me lo doy. Lo que no doy, me lo quito. Nada para mí que no sea para los otros.”)

This last quote is about long term rewards, and that the thing you give today will come back to you in an indirect way some day… The biggest problem of our modern societies is their focus on short term rewards. For example, a few days ago, I discovered that my optometrist did not tell me that now I could get glasses down to 0.01 dioptres precision (instead of the usual 0.25) going to certain other places because he’d lose me as a client… (And I did complain to him about this low precision of the correction)

If I’m part of the Cardano community, it’s because I want this world to change for the better and grow out of this kind of selfish behavior that’s hurting everyone.

With the coming smart contract support the memory management will probably become even more important so I decided it was time to give my findings to everyone.

So since I’m giving you this solution for free, if you find it useful, it would be very nice of you to delegate some of your ADA to my pool anfra.io (ANFRA).

This was a long but necessary introduction.

To solve this problem I had to take the time to study and understand GHC’s memory management and especially how the garbage collector actually works so I could tune the RTS parameters accordingly. After reading this you’ll have a much better understanding of what is happening.

By default GHC’s runtime system uses two zones (generations) for its memory, a small hot zone and a big cold zone where long term data is moved from the hot zone after some time.

There are two kinds of garbage collections: the small GC and the major GC. The small GC only deals with the hot zone and happens very frequently as soon as the hot zone is “full”, things that are not changing are “promoted” to the next generation (what I’ve called the cold zone).
The hot zone is limited in size, so the small GCs have to happen very frequently, on the other hand the second generation can grow indefinitely and is garbage collected only when it’s full or when a major GC is forced (from inside the program or depending on certain RTS configuration parameters as we will see later). The trick is that to optimize things, every time a major GC occurs, the RTS system will try to delay the next one by increasing the zone’s size and allocating more memory.
The fastest way to cleanup the memory is to copy all the useful data somewhere else while leaving the garbage behind so the RTS copies all the used RAM into a continuous new location and “discards” the old one. So to be able to do that, it always needs to keep allocated at least twice as much as RAM than is actually used by the program.

Stated simply, the problem is that a few versions of the node ago, IOG disabled the automatic major garbage collections that were happening every time the node was idle for 0.3s (that’s the RTS default configuration). Since that time, the amount of used RAM increases every time a major GC is forced due to memory exhaustion. GHC’s memory management makes it very reluctant to ever release RAM (this is going to be improved with GHC 9.2) and very proactive at allocating it. So in the end it can allocate more than 3 times the amount of RAM actually needed by the program to run and can go way beyond that if there is a spike of memory needed for some reason as it will hardly ever release the extra amount of RAM it allocated.
This makes memory exhaustion almost inevitable after some time for very demanding and long running programs like our dear cardano-node.

You can get the default RTS parameters the node is compiled with using the following command:

cardano-node +RTS --info

And you will get something like this:

[("GHC RTS", "YES")
 ,("GHC version", "8.10.5")
 ,("RTS way", "rts_thr")
 ,("Build platform", "aarch64-unknown-linux")
 ,("Build architecture", "aarch64")
 ,("Build OS", "linux")
 ,("Build vendor", "unknown")
 ,("Host platform", "aarch64-unknown-linux")
 ,("Host architecture", "aarch64")
 ,("Host OS", "linux")
 ,("Host vendor", "unknown")
 ,("Target platform", "aarch64-unknown-linux")
 ,("Target architecture", "aarch64")
 ,("Target OS", "linux")
 ,("Target vendor", "unknown")
 ,("Word size", "64")
 ,("Compiler unregisterised", "NO")
 ,("Tables next to code", "YES")
 ,("Flag -with-rtsopts", "-T -I0 -A16m -N2 --disable-delayed-os-memory-return")
 ]

So the default parameters are:

-T -I0 -A16m -N2 --disable-delayed-os-memory-return

The -I0 is what disabled the automatic periodic major GCs, -N2 is the number of CPU to use, -T is for RTS statistics to be available from inside of the program, -A16m is the default allocation size for the hot zone’s new blocks, --disable-delayed-os-memory-return is supposed to make the RTS release memory more visibly but has hardly any effect because the RTS almost never releases memory anyway (from what I’ve read, it will be on by default in next version of GHC).

It’s important to understand that garbage collections are very costly in terms of CPU and have a tendency to “stop the world” which can be very bad when your program must react to real time events because if it’s in the middle of a major GC, it will just be frozen and miss the mark. So tuning the GC is a balance between real time performance and memory usage.

Since a few versions of GHC, new parameters were introduced to deal with this balance. The most important one and the one that probably does 80% of the job is the -Iw parameter to be paired with the -I parameter. While -I tells the RTS to perform a major GC every time the program has been idle for a particular minimum amount of time, -Iw introduces a minimum delay between each major GC so that your program will not perform a big GC every time it’s idle (very bad for services, and that’s probably why IOG disabled it).

Another very interesting parameter is -F which will tune the amount of superfluous RAM the RTS allocates after a major GCs occurs in the hope of delaying the next one and avoiding costly small memory allocations. By default it’s set to 2, which means that if a major GC occurs after your program live data reached 3GB, the RTS will allocate no less than 6GB of RAM in total + the size of the hot zone, even if your program live data ends up being less than 2GB after the GC cleaned all the garbage.
In GHC 9.2 there is going to be a new RTS option that will dynamically gradually decrease this parameter so that RAM can be returned to the OS after a memory spike. For now, in GHC 8.10.x, it is fixed from startup.

The last useful parameter is the -H one which tells the RTS how much memory your program should be or will be using, it makes the -A parameter dynamic and lowers the rate of small GCs since the hot zone is less often exhausted. This parameter is very useful to decrease the node’s startup time as it will remove hundreds or even thousands of garbage collection runs and small memory allocations.

Lastly, you can add the -S parameter to get some useful statistics on garbage collection.

So without further ado, here are the holy grail RTS parameters that will save you a large amount of money in your server’s RAM and cost me my useless competitive advantage as an SPO :stuck_out_tongue: :

-N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S

These parameters have been tested with GHC 8.10.4 and 8.10.5 on Debian based GNU/Linux systems.

To use these parameters, you need to launch the node like this::

`cardano-node +RTS -N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S -RTS --your-usual-parameters`

For some reason the -S parameter must be set last or it will not work but as said above this one is optional.

So to sum up:

-N2 tells the RTS to dispatch the programs threads on 2 cpus (using -N without a number will make it use all the CPUs available)

--disable-delayed-os-memory-return tells the RTS to use a different and less optimized way to release memory so that GNU/Linux performance tools such as top and htop report the correct amount of resident memory being used (mostly useless as explained above)

-I0.3 tells the RTS to perform a major GC if the node has been idle for 0.3s

-Iw600 tells the RTS to perform the above major GC only if the last one happened more than 10 minutes ago.

-A16m is the size of the new block allocated in the hot zone, it should be adapted according to the number of cpu you configured with the -N parameter

-F1.5 roughly tells the RTS to keep allocated 1.5 times more than the amount of live data it found when it started its last major GC.

-H2500M forces the RTS to allocate 2.5GB of RAM at startup and keep this minimum allocated. (the node’s live data is usually around 2GB, so this seems appropriate)

-T makes the RTS statistics available programmatically.

-S makes the node print various garbage collection statistics on the standard error output

So with these parameters, after 17 days of uptime here are the stats of one of my relays:

Transactions Processed: 475,701 - Uptime: 17d 02h 59' 58" - Mem: 5,886,963,712 - maxMem: 5,905,235,968 - Live: 1,740,568,168 - Heap: 5,845,811,200 - GCs: (3,834-41,791-36%)

3834 is the number of major GCs and 41791 the number of small GCs performed. 36% is the amount of CPU time that has been “wasted” by the GC instead of being used for useful computations (this is an important metric to watch for when playing with GC parameters).

Note that I also played with the new mark-and-sweep garbage collection mode (-xn parameter) which is using a different method than simply copying the whole memory but this will make the node unstable and segmentation faults will occur as of GHC 8.10.5.

I also played with the -c parameter but this makes a producer node to miss leader checks so it should not be used.

I’m not a GHC expert so some of my understanding can be wrong so don’t hesitate to correct me if you know better, I’m always happy to learn new things.

There probably exist even better combinations of these parameters, if you want to help I recommend you to read the following pages:

Coming in GHC 9.2 and a good primer on GHC’s current memory management:
Improvements to memory usage in GHC 9.2 - Well-Typed: The Haskell Consultants

The documentation of the current version of GHC explaining each RTS parameters in details:
5.7. Running a compiled program — Glasgow Haskell Compiler 9.0.1 User’s Guide

Another interesting read about the -xn parameter and current GC algorithm:
Low-latency garbage collector merged for GHC 8.10 - Well-Typed: The Haskell Consultants

Remember, my stake pool ticker is ANFRA, it would love some more ADA delegation if you can spare some :slight_smile:

49 Likes

Wow!!! This seems really useful and interesting. I will test the next days.

Thanks a lot for your time and work with this.

1 Like

Thanks for info, but where do I need to make the changes, I have cncli caradano node

Wow this is an amazing effort! Thanks for that.

1 Like

Thank you for sharing. Have a look at replacing disk swap with zram. I have had cnode pid using 1gb of virtual compressed swap in ram without breaking a sweat.

https://lists.ubuntu.com/archives/lubuntu-users/2013-October/005831.html

3 Likes

I am not a SPO. I cannot say anything about technicals, but I can see your attitude.
I am part of Cardano community, and we need more people like you!
I will spread your honorable action

Very Thanks!

7 Likes

Brilliant post👏

I’m not going to pretend i fully understand your solution, but I do appreciate the effort you’ve gone into to achieve a solution after many hours and taking the time to write this lengthy post

I particularly love your post for the sharing and benefit it brings to ALL of us in the community. :purple_heart:

There are so many good people within Cardano community, the sentiments you’ve demonstrated in this post will i’m sure, be well rewarded

Best wishes to your successful future👏

4 Likes

Amazing! Will be testing in the raspberry pi community soon.

1 Like

Thank you for the hard work and amazing attitude!

1 Like

Mesmerizing :star_struck:

1 Like

Good on ya! Nice work.

1 Like

Really appreciate you sharing this after all the obvious time and effort you put into it. People like you make the world a better place - congrats on being you. :confetti_ball:

1 Like

Really great work @_2072 !! Thanks for sharing.

Btw, I heard that shortly we will get new node software that will no longer hold the whole blockchain in memory but use the harddrive instead. This should decrease mem usage drastically and increase harddrive space. But then still this post is golden!!

3 Likes

Impressive! Great information and thank you for sharing.

1 Like

Thanks for this post /u/_2072. Do you have any more details around what experiments you’ve run and settings etc… that you could share either here or ben.ohanlon@iohk.io?

5 Likes

Thank you everyone for your kind words :slightly_smiling_face:

I do not have much more details about the experiments I’ve made, I can tell you what I have tried that did not work though:

  • Increasing the number of generations from 2 to 3 resulted into a bigger heap so I quickly reverted to the default 2.
  • -c parameter that enables a compacting algorithm for collecting the oldest generation, this parameter resulted in a much lower memory footprint but resulted in much longer GC pauses which made the nodes less responsive and, on a producer node would make it randomly miss a few leader checks per hour. This might be useable on machines with fast CPU though but these machines generally also have enough RAM…
  • -M parameter to limit the maximum heap and raise an internal exception if the limit is reached so that the program can try to stop cleanly (you can set a leeway post exception to allow the program to stop itself). I tried this but this parameter also enables -c once a certain threshold is reached causing the same performances issue.
  • -xn this parameter enable the new non-moving GC algorithm and gives overall better performances but the problem is that it’s unstable (random segmentation faults)
  • I also tried various values for the -F parameter: the lower the value the more often the GC is triggered by heap overflow and frequent memory allocation are necessary which makes everything very slow. so 1.5 seems a good compromise. Something to try maybe would be a lower value with a bigger -H such as -F1 -H5G which might allow for a lower heap size (>=5G) but this would need to be carefully tuned to the actual memory requirement of the node which is difficult to determine…

Apart from this, another thing I noticed is that the memory footprint depends vastly on the garbage collections happening during the node startup which is mostly random and dependent, it seems, on IO efficiency, if one is lucky and GCs happened at the right time (before the live data becomes big and right after a lot of garbage was created) then the memory footprint will stay low.

I’m not sure if there is code in the node to call performMajorGC but this could also be a solution because, in the end, my solution just consists in making sure a major GC is performed at an average of every 10 minutes, (I’m using -Iw300 to -Iw600 depending on my relays but values lower than 600 don’t seem to make any difference in heap size).

The best solution would be for the node to performMajorGC regularly on its own when there is time instead of waiting for the RTS to do it on heap overflow (this is what makes the heap to grow). It could also be performed after certain operations that are known to create a lot of garbage. When the node is acting as a producer, we could imagine that it checks for a minute in advance for slot leader and when it knows it has enough time it triggers a major GC…

I think that GHC 9.2 with the new -Fd, -I0.3 (the default) and -Iw600 will solve this issue completely as it will allow releasing memory to the OS which never happen with GHC 8.10.x.

5 Likes

Awesome! Thanks for your altruistic share. As a new SPO I am really looking forward to giving this a try. BGOOD.

First of all: this is why it is so good and important to have technically skilled SPO people in the community, who don’t hesitate to lookout for new and improved things.

I did some testings with the mentioned RTS options (exactly as suggested) on one of my existing mainnet relays and also the Blockproducer
Both servers run with 16 GB of RAM

The resource monitoring was in place before, so the following graphs show the difference between default and suggested tweak mode. The red arrow in the graph marks the restart with new RTS parameters. The blue circles mark the last epoch switch.

as no full epoch has passed I may will update these graphs after some days, in order to see long term effects/differences including both the 36h reward calc and the epoch switch phase.

This Graph shows Linux SNMP physical free bytes (the lower the less free mem is available) on the relay node. It looks like mem consumption reduced by roughly around 1GB

This is the same server and time range, but by looking at the nodeMem value reported by cardano-node EKG

This relay does restart every 24h to reload a fresh topology.

The BlockProducer had an uptime of around 3 weeks before I restarted it with the new parameters.

Same timerange but nodeMem values reported by the blockproducers cardano-node process

here the difference/improvement is only around 300-400 MB after the epoch switch processings. It will take some more days to show how it’s gone long term.

Note: GHC mem management may looks a bit different on nodes with less mem. I have lowered one relay to 8 GB, will monitor it now for some days in default mode, then switch to the suggested RTS settings, continue monitoring and show it here then…

1 Like

what does this new version come out? how do we get notifications? sorry, newbie here.

Thanks!

What version? It’s just an optimization/customization of the actual version