After months of research, worry and despair I’ve finally solved my relays and producer memory usage problem where some of my relays would end up using all of their RAM and start swapping and suffering from IO delays until restarted…
I’ve hesitated a long time before making this post because why in the hell would I share a competitive advantage like this one? Since all SPOs are competing against each other, in a very harsh environment where the rich still have a very unfair advantage over hard working SPOs, I should have no incentive to share my findings but I already lost nearly all my delegates due to the low stake of my pool and the long streaks of epoch without blocks, so what have I to lose?
Moreover, keeping this kind of solution for myself goes against my core values based on sharing and truth. Someone said that being honest in a world of thieves is being stupid ; but well, maybe, if enough people start to behave honestly the thieves will be the one who will look stupid in the end…
The same person also said something else: “What I give, I give it to myself, what I don’t give, I take it from myself. Nothing for me that is not for the others”. (“Lo que doy, me lo doy. Lo que no doy, me lo quito. Nada para mí que no sea para los otros.”)
This last quote is about long term rewards, and that the thing you give today will come back to you in an indirect way some day… The biggest problem of our modern societies is their focus on short term rewards. For example, a few days ago, I discovered that my optometrist did not tell me that now I could get glasses down to 0.01 dioptres precision (instead of the usual 0.25) going to certain other places because he’d lose me as a client… (And I did complain to him about this low precision of the correction)
If I’m part of the Cardano community, it’s because I want this world to change for the better and grow out of this kind of selfish behavior that’s hurting everyone.
With the coming smart contract support the memory management will probably become even more important so I decided it was time to give my findings to everyone.
So since I’m giving you this solution for free, if you find it useful, it would be very nice of you to delegate some of your ADA to my pool anfra.io (ANFRA).
This was a long but necessary introduction.
To solve this problem I had to take the time to study and understand GHC’s memory management and especially how the garbage collector actually works so I could tune the RTS parameters accordingly. After reading this you’ll have a much better understanding of what is happening.
By default GHC’s runtime system uses two zones (generations) for its memory, a small hot zone and a big cold zone where long term data is moved from the hot zone after some time.
There are two kinds of garbage collections: the small GC and the major GC. The small GC only deals with the hot zone and happens very frequently as soon as the hot zone is “full”, things that are not changing are “promoted” to the next generation (what I’ve called the cold zone).
The hot zone is limited in size, so the small GCs have to happen very frequently, on the other hand the second generation can grow indefinitely and is garbage collected only when it’s full or when a major GC is forced (from inside the program or depending on certain RTS configuration parameters as we will see later). The trick is that to optimize things, every time a major GC occurs, the RTS system will try to delay the next one by increasing the zone’s size and allocating more memory.
The fastest way to cleanup the memory is to copy all the useful data somewhere else while leaving the garbage behind so the RTS copies all the used RAM into a continuous new location and “discards” the old one. So to be able to do that, it always needs to keep allocated at least twice as much as RAM than is actually used by the program.
Stated simply, the problem is that a few versions of the node ago, IOG disabled the automatic major garbage collections that were happening every time the node was idle for 0.3s (that’s the RTS default configuration). Since that time, the amount of used RAM increases every time a major GC is forced due to memory exhaustion. GHC’s memory management makes it very reluctant to ever release RAM (this is going to be improved with GHC 9.2) and very proactive at allocating it. So in the end it can allocate more than 3 times the amount of RAM actually needed by the program to run and can go way beyond that if there is a spike of memory needed for some reason as it will hardly ever release the extra amount of RAM it allocated.
This makes memory exhaustion almost inevitable after some time for very demanding and long running programs like our dear cardano-node.
You can get the default RTS parameters the node is compiled with using the following command:
cardano-node +RTS --info
And you will get something like this:
[("GHC RTS", "YES")
,("GHC version", "8.10.5")
,("RTS way", "rts_thr")
,("Build platform", "aarch64-unknown-linux")
,("Build architecture", "aarch64")
,("Build OS", "linux")
,("Build vendor", "unknown")
,("Host platform", "aarch64-unknown-linux")
,("Host architecture", "aarch64")
,("Host OS", "linux")
,("Host vendor", "unknown")
,("Target platform", "aarch64-unknown-linux")
,("Target architecture", "aarch64")
,("Target OS", "linux")
,("Target vendor", "unknown")
,("Word size", "64")
,("Compiler unregisterised", "NO")
,("Tables next to code", "YES")
,("Flag -with-rtsopts", "-T -I0 -A16m -N2 --disable-delayed-os-memory-return")
]
So the default parameters are:
-T -I0 -A16m -N2 --disable-delayed-os-memory-return
The -I0
is what disabled the automatic periodic major GCs, -N2
is the number of CPU to use, -T
is for RTS statistics to be available from inside of the program, -A16m
is the default allocation size for the hot zone’s new blocks, --disable-delayed-os-memory-return
is supposed to make the RTS release memory more visibly but has hardly any effect because the RTS almost never releases memory anyway (from what I’ve read, it will be on by default in next version of GHC).
It’s important to understand that garbage collections are very costly in terms of CPU and have a tendency to “stop the world” which can be very bad when your program must react to real time events because if it’s in the middle of a major GC, it will just be frozen and miss the mark. So tuning the GC is a balance between real time performance and memory usage.
Since a few versions of GHC, new parameters were introduced to deal with this balance. The most important one and the one that probably does 80% of the job is the -Iw
parameter to be paired with the -I
parameter. While -I
tells the RTS to perform a major GC every time the program has been idle for a particular minimum amount of time, -Iw
introduces a minimum delay between each major GC so that your program will not perform a big GC every time it’s idle (very bad for services, and that’s probably why IOG disabled it).
Another very interesting parameter is -F
which will tune the amount of superfluous RAM the RTS allocates after a major GCs occurs in the hope of delaying the next one and avoiding costly small memory allocations. By default it’s set to 2, which means that if a major GC occurs after your program live data reached 3GB, the RTS will allocate no less than 6GB of RAM in total + the size of the hot zone, even if your program live data ends up being less than 2GB after the GC cleaned all the garbage.
In GHC 9.2 there is going to be a new RTS option that will dynamically gradually decrease this parameter so that RAM can be returned to the OS after a memory spike. For now, in GHC 8.10.x, it is fixed from startup.
The last useful parameter is the -H
one which tells the RTS how much memory your program should be or will be using, it makes the -A
parameter dynamic and lowers the rate of small GCs since the hot zone is less often exhausted. This parameter is very useful to decrease the node’s startup time as it will remove hundreds or even thousands of garbage collection runs and small memory allocations.
Lastly, you can add the -S
parameter to get some useful statistics on garbage collection.
So without further ado, here are the holy grail RTS parameters that will save you a large amount of money in your server’s RAM and cost me my useless competitive advantage as an SPO :
-N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S
These parameters have been tested with GHC 8.10.4 and 8.10.5 on Debian based GNU/Linux systems.
To use these parameters, you need to launch the node like this::
`cardano-node +RTS -N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S -RTS --your-usual-parameters`
For some reason the -S
parameter must be set last or it will not work but as said above this one is optional.
So to sum up:
-N2
tells the RTS to dispatch the programs threads on 2 cpus (using -N
without a number will make it use all the CPUs available)
--disable-delayed-os-memory-return
tells the RTS to use a different and less optimized way to release memory so that GNU/Linux performance tools such as top and htop report the correct amount of resident memory being used (mostly useless as explained above)
-I0.3
tells the RTS to perform a major GC if the node has been idle for 0.3s
-Iw600
tells the RTS to perform the above major GC only if the last one happened more than 10 minutes ago.
-A16m
is the size of the new block allocated in the hot zone, it should be adapted according to the number of cpu you configured with the -N
parameter
-F1.5
roughly tells the RTS to keep allocated 1.5 times more than the amount of live data it found when it started its last major GC.
-H2500M
forces the RTS to allocate 2.5GB of RAM at startup and keep this minimum allocated. (the node’s live data is usually around 2GB, so this seems appropriate)
-T
makes the RTS statistics available programmatically.
-S
makes the node print various garbage collection statistics on the standard error output
So with these parameters, after 17 days of uptime here are the stats of one of my relays:
Transactions Processed: 475,701 - Uptime: 17d 02h 59' 58" - Mem: 5,886,963,712 - maxMem: 5,905,235,968 - Live: 1,740,568,168 - Heap: 5,845,811,200 - GCs: (3,834-41,791-36%)
3834
is the number of major GCs and 41791
the number of small GCs performed. 36%
is the amount of CPU time that has been “wasted” by the GC instead of being used for useful computations (this is an important metric to watch for when playing with GC parameters).
Note that I also played with the new mark-and-sweep garbage collection mode (-xn
parameter) which is using a different method than simply copying the whole memory but this will make the node unstable and segmentation faults will occur as of GHC 8.10.5.
I also played with the -c
parameter but this makes a producer node to miss leader checks so it should not be used.
I’m not a GHC expert so some of my understanding can be wrong so don’t hesitate to correct me if you know better, I’m always happy to learn new things.
There probably exist even better combinations of these parameters, if you want to help I recommend you to read the following pages:
Coming in GHC 9.2 and a good primer on GHC’s current memory management:
Improvements to memory usage in GHC 9.2 - Well-Typed: The Haskell Consultants
The documentation of the current version of GHC explaining each RTS parameters in details:
5.7. Running a compiled program — Glasgow Haskell Compiler 9.0.1 User’s Guide
Another interesting read about the -xn
parameter and current GC algorithm:
Low-latency garbage collector merged for GHC 8.10 - Well-Typed: The Haskell Consultants
Remember, my stake pool ticker is ANFRA, it would love some more ADA delegation if you can spare some