Solving the Cardano node huge memory usage - done

Thank you everyone for your kind words :slightly_smiling_face:

I do not have much more details about the experiments I’ve made, I can tell you what I have tried that did not work though:

  • Increasing the number of generations from 2 to 3 resulted into a bigger heap so I quickly reverted to the default 2.
  • -c parameter that enables a compacting algorithm for collecting the oldest generation, this parameter resulted in a much lower memory footprint but resulted in much longer GC pauses which made the nodes less responsive and, on a producer node would make it randomly miss a few leader checks per hour. This might be useable on machines with fast CPU though but these machines generally also have enough RAM…
  • -M parameter to limit the maximum heap and raise an internal exception if the limit is reached so that the program can try to stop cleanly (you can set a leeway post exception to allow the program to stop itself). I tried this but this parameter also enables -c once a certain threshold is reached causing the same performances issue.
  • -xn this parameter enable the new non-moving GC algorithm and gives overall better performances but the problem is that it’s unstable (random segmentation faults)
  • I also tried various values for the -F parameter: the lower the value the more often the GC is triggered by heap overflow and frequent memory allocation are necessary which makes everything very slow. so 1.5 seems a good compromise. Something to try maybe would be a lower value with a bigger -H such as -F1 -H5G which might allow for a lower heap size (>=5G) but this would need to be carefully tuned to the actual memory requirement of the node which is difficult to determine…

Apart from this, another thing I noticed is that the memory footprint depends vastly on the garbage collections happening during the node startup which is mostly random and dependent, it seems, on IO efficiency, if one is lucky and GCs happened at the right time (before the live data becomes big and right after a lot of garbage was created) then the memory footprint will stay low.

I’m not sure if there is code in the node to call performMajorGC but this could also be a solution because, in the end, my solution just consists in making sure a major GC is performed at an average of every 10 minutes, (I’m using -Iw300 to -Iw600 depending on my relays but values lower than 600 don’t seem to make any difference in heap size).

The best solution would be for the node to performMajorGC regularly on its own when there is time instead of waiting for the RTS to do it on heap overflow (this is what makes the heap to grow). It could also be performed after certain operations that are known to create a lot of garbage. When the node is acting as a producer, we could imagine that it checks for a minute in advance for slot leader and when it knows it has enough time it triggers a major GC…

I think that GHC 9.2 with the new -Fd, -I0.3 (the default) and -Iw600 will solve this issue completely as it will allow releasing memory to the OS which never happen with GHC 8.10.x.

9 Likes

Awesome! Thanks for your altruistic share. As a new SPO I am really looking forward to giving this a try. BGOOD.

First of all: this is why it is so good and important to have technically skilled SPO people in the community, who don’t hesitate to lookout for new and improved things.

I did some testings with the mentioned RTS options (exactly as suggested) on one of my existing mainnet relays and also the Blockproducer
Both servers run with 16 GB of RAM

The resource monitoring was in place before, so the following graphs show the difference between default and suggested tweak mode. The red arrow in the graph marks the restart with new RTS parameters. The blue circles mark the last epoch switch.

as no full epoch has passed I may will update these graphs after some days, in order to see long term effects/differences including both the 36h reward calc and the epoch switch phase.

This Graph shows Linux SNMP physical free bytes (the lower the less free mem is available) on the relay node. It looks like mem consumption reduced by roughly around 1GB

This is the same server and time range, but by looking at the nodeMem value reported by cardano-node EKG

This relay does restart every 24h to reload a fresh topology.

The BlockProducer had an uptime of around 3 weeks before I restarted it with the new parameters.

Same timerange but nodeMem values reported by the blockproducers cardano-node process

here the difference/improvement is only around 300-400 MB after the epoch switch processings. It will take some more days to show how it’s gone long term.

Note: GHC mem management may looks a bit different on nodes with less mem. I have lowered one relay to 8 GB, will monitor it now for some days in default mode, then switch to the suggested RTS settings, continue monitoring and show it here then…

5 Likes

what does this new version come out? how do we get notifications? sorry, newbie here.

Thanks!

1 Like

What version? It’s just an optimization/customization of the actual version

sorry. the quote above…

“I think that GHC 9.2 with the new -Fd , -I0.3 (the default) and -Iw600 will solve this issue completely as it will allow releasing memory to the OS which never happen with GHC 8.10.x.”

i was wondering when GHC 9.2 would be released, and more importantly, how would i get it? would this be a git pull and rebuild?

thanks

It’s difficult to trigger the problematic behavior on purpose. Certain of my relays did experience this “heap leak” and start swapping after just 12h some after 3 days and some never (like your relay which was up for 3 weeks with constant RAM usage). Now it’s true that none of my relays have 16Gb of RAM so GHC’s RTS might indeed behave differently depending on the amount of free RAM you have, or depending on your GNU/Linux distro.
It may also depend on the kind of requests your node gets from the Internet or locally (ledger dumps for example), etc… If you can already run your nodes for 3 weeks without increasing RAM usage, then you probably do not need this solution, at least for now. At first only one of my relays was experiencing this then a few monts later a second… At some point I even thought that I was victim of a kind of DOS attack.

You need to check GHC’s website for that then IOG probably will have to update their build to allow 9.2 to be used (there can be breaking changes between major versions of GHC). Once it’s released you just need to install it and build the node with it following the usual instructions with the proper modification to change the compiler version.

1 Like

Interesting work.

We’ve always found memory management to be fine (excepting one leak which was quickly fixed about 5 node versions ago). Cardano-node will use what memory you give it over time (lazy collection) up to about 20GB. For busy relays keeping track of 60+ connections inbound and outbound we like to allow 16GB. For block producers 10-12GB is fine.

You can get by on less (the minimum spec is 8GB) but these numbers give us a very stable setup.

Hi 2072,

Thanks for posting this. I’ve been very interested in heap size / garbage collection tuning in Java as from my experience it can indeed make a world of difference. I have no experience with this in Haskell though.

I was wondering, do you know what the application throughput is with the default GC parameters?

I understand Haskell will inherently produce more garbage since it is a true functional language. An application throughput of just 64% seems low when compared to the numbers I’m used to in Java. That’s why I am wondering. :slight_smile:

Thanks in advance!

I got the following after running a relay with default parameters for an hour.

 22380288   1225752   2285880  0.204  0.209   80.196 4146.292    0    0  (Gen:  1)
     3736                      0.000  0.000

  36,327,280,448 bytes allocated in the heap
   8,365,121,336 bytes copied during GC
   1,552,878,096 bytes maximum residency (20 sample(s))
      31,581,680 bytes maximum slop
            3099 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1197 colls,  1197 par   13.495s   6.840s     0.0057s    0.0697s
  Gen  1        20 colls,    19 par    7.179s   3.752s     0.1876s    1.6108s

  Parallel GC work balance: 33.57% (serial 0%, perfect 100%)

  TASKS: 17 (1 bound, 14 peak workers (16 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time   59.519s  (4135.698s elapsed)
  GC      time   20.674s  ( 10.592s elapsed)
  EXIT    time    0.004s  (  0.009s elapsed)
  Total   time   80.199s  (4146.300s elapsed)

  Alloc rate    610,346,326 bytes per MUT second

  Productivity  74.2% of total user, 99.7% of total elapsed

1 Like

What’s the reason that it’s configured to only use 2 cores? Will allocating more be a problem or is that just unnecessary?

This is exactly why Cardano delegators should support small decentralized pools. By supporting a wider range of pools, delegators promote more insights into the better running of the Cardano system. Just chasing perceived better returns, supporting dubious missions or following the latest influencer is not enough to ensure that Cardano reaches its full potential or even survives.

7 Likes

Thank you so much for this, this is amazing work!

This may be a dumb question but with the -N parameter, do you have two cpus in your one relay? if I had one cpu in the relay would I set it to -N1. Or is this parameter for your node as a whole eg relay, relay and BP? I am just quite surprised if you have two cpus in the one relay haha.

If you have a node that uses up to 20GB it would be really interesting to try those parameters and tell us the results.

No I have not made that test but from my observation this can vary a lot depending on the relays, I’m not sure why (probably depends on the kind of work and various system hardware specificities).

for example, my most efficient relay is at 80% of productivity while the median is 65%, the worst one is a raspberry pie with only 34% (probably because memory copy and move are much slower on this kind of hardware where GHC is still using LLVM to compile as far as I know…).

The problem is that each major GC implies to copy the whole live data each time (around 3Gb) even if there is nothing to collect because with these settings a major GC is forced every 10 minutes if the node is not busy. So this productivity rating just shows that CPU is wasted doing GCs while the program is idle.

-N2 is the default setting the node is compiled with, maybe they don’t know that -N without a number use all the CPUs available, or maybe this is a new GHC feature, or they find -N2 is enough… Only IOG can answer to this question. I have several nodes running with -N4 without problem.

3 Likes

Like the others here I just wanted to thank you for sharing this information that you clearly put a lot of effort into uncovering.

It’s fascinating and benefits everyone to better understand the systems we’re using and how they can be optimized. Your attitude about it is also impressive, and I think says a lot about you and this community. I truly sympathize with the sentiment of your intro and I’ve already recommended your pool to a few people :slight_smile:

Well done and thank you again!

1 Like

Wouldn’t this be ‘safer’ to just leave on -I0 for a block producer? Or do you know a reason that -I0 will give problems in combination with your other tweaks?

How should it adapt to the number of CPUs configured?

1 Like

Setting -I0 disables the periodic garbage and therefore nullify the effect of the other parameters. With -I0, the garbage collector happens only when it’s too late to keep the memory usage low and will, as I explained, make the node use more memory each time.

I’m using these parameters on my producer node and I never miss blocks (except at the epoch transition like everyone else).
Also, performing regular GCs ensures that they stay short and thus prevent longer pauses. If GCs are performed only when the heap is exhausted then you take the risk of missing blocks every time a major GC is triggered (depending on the speed of your CPUs and RAM).

From what I could understand, -A16m asks the RTS to allocate 16 Mib per core thus, the more cores you use (with -N) the more memory it will use. In fact there is another parameter -n to divide this amount in smaller chunks when running on several cores, it’s supposed to prevent triggering GCs when certain threads are allocating faster than other cores… It’s woth trying but I havent played with this for now.

1 Like

Do you use it on BP node too?

Can you use it on BP node too?

Thanks a lot for the clean documentation and the insight @_2072
I’m currently testing your parameters and reading the GHC doco too.
I can see you have the the same architecture as mine too :wink:

1 Like

Just to give an update to the community and to @_2072.
This is working great, and you advices are working really well. Reading this doco also was a great help understanding this part of GHC that I never did. Thanks heaps mate !!!

1 Like