Solving the Cardano node huge memory usage - done

chimcentral · 20 July 2021 19:24

sorry. the quote above…

“I think that GHC 9.2 with the new -Fd , -I0.3 (the default) and -Iw600 will solve this issue completely as it will allow releasing memory to the OS which never happen with GHC 8.10.x.”

i was wondering when GHC 9.2 would be released, and more importantly, how would i get it? would this be a git pull and rebuild?

thanks

_2072 · 20 July 2021 21:43

It’s difficult to trigger the problematic behavior on purpose. Certain of my relays did experience this “heap leak” and start swapping after just 12h some after 3 days and some never (like your relay which was up for 3 weeks with constant RAM usage). Now it’s true that none of my relays have 16Gb of RAM so GHC’s RTS might indeed behave differently depending on the amount of free RAM you have, or depending on your GNU/Linux distro.
It may also depend on the kind of requests your node gets from the Internet or locally (ledger dumps for example), etc… If you can already run your nodes for 3 weeks without increasing RAM usage, then you probably do not need this solution, at least for now. At first only one of my relays was experiencing this then a few monts later a second… At some point I even thought that I was victim of a kind of DOS attack.

You need to check GHC’s website for that then IOG probably will have to update their build to allow 9.2 to be used (there can be breaking changes between major versions of GHC). Once it’s released you just need to install it and build the node with it following the usual instructions with the proper modification to change the compiler version.

cyberruss · 21 July 2021 08:22

Interesting work.

We’ve always found memory management to be fine (excepting one leak which was quickly fixed about 5 node versions ago). Cardano-node will use what memory you give it over time (lazy collection) up to about 20GB. For busy relays keeping track of 60+ connections inbound and outbound we like to allow 16GB. For block producers 10-12GB is fine.

You can get by on less (the minimum spec is 8GB) but these numbers give us a very stable setup.

weebl2000 · 21 July 2021 10:32

Hi 2072,

Thanks for posting this. I’ve been very interested in heap size / garbage collection tuning in Java as from my experience it can indeed make a world of difference. I have no experience with this in Haskell though.

I was wondering, do you know what the application throughput is with the default GC parameters?

I understand Haskell will inherently produce more garbage since it is a true functional language. An application throughput of just 64% seems low when compared to the numbers I’m used to in Java. That’s why I am wondering.

Thanks in advance!

I got the following after running a relay with default parameters for an hour.

 22380288   1225752   2285880  0.204  0.209   80.196 4146.292    0    0  (Gen:  1)
     3736                      0.000  0.000

  36,327,280,448 bytes allocated in the heap
   8,365,121,336 bytes copied during GC
   1,552,878,096 bytes maximum residency (20 sample(s))
      31,581,680 bytes maximum slop
            3099 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1197 colls,  1197 par   13.495s   6.840s     0.0057s    0.0697s
  Gen  1        20 colls,    19 par    7.179s   3.752s     0.1876s    1.6108s

  Parallel GC work balance: 33.57% (serial 0%, perfect 100%)

  TASKS: 17 (1 bound, 14 peak workers (16 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time   59.519s  (4135.698s elapsed)
  GC      time   20.674s  ( 10.592s elapsed)
  EXIT    time    0.004s  (  0.009s elapsed)
  Total   time   80.199s  (4146.300s elapsed)

  Alloc rate    610,346,326 bytes per MUT second

  Productivity  74.2% of total user, 99.7% of total elapsed

brouwerQ · 22 July 2021 07:28

What’s the reason that it’s configured to only use 2 cores? Will allocating more be a problem or is that just unnecessary?

arcter · 22 July 2021 07:59

This is exactly why Cardano delegators should support small decentralized pools. By supporting a wider range of pools, delegators promote more insights into the better running of the Cardano system. Just chasing perceived better returns, supporting dubious missions or following the latest influencer is not enough to ensure that Cardano reaches its full potential or even survives.

mrdrifter · 22 July 2021 09:27

Thank you so much for this, this is amazing work!

This may be a dumb question but with the -N parameter, do you have two cpus in your one relay? if I had one cpu in the relay would I set it to -N1. Or is this parameter for your node as a whole eg relay, relay and BP? I am just quite surprised if you have two cpus in the one relay haha.

_2072 · 22 July 2021 23:27

If you have a node that uses up to 20GB it would be really interesting to try those parameters and tell us the results.

No I have not made that test but from my observation this can vary a lot depending on the relays, I’m not sure why (probably depends on the kind of work and various system hardware specificities).

for example, my most efficient relay is at 80% of productivity while the median is 65%, the worst one is a raspberry pie with only 34% (probably because memory copy and move are much slower on this kind of hardware where GHC is still using LLVM to compile as far as I know…).

The problem is that each major GC implies to copy the whole live data each time (around 3Gb) even if there is nothing to collect because with these settings a major GC is forced every 10 minutes if the node is not busy. So this productivity rating just shows that CPU is wasted doing GCs while the program is idle.

-N2 is the default setting the node is compiled with, maybe they don’t know that -N without a number use all the CPUs available, or maybe this is a new GHC feature, or they find -N2 is enough… Only IOG can answer to this question. I have several nodes running with -N4 without problem.

patience · 23 July 2021 17:47

Like the others here I just wanted to thank you for sharing this information that you clearly put a lot of effort into uncovering.

It’s fascinating and benefits everyone to better understand the systems we’re using and how they can be optimized. Your attitude about it is also impressive, and I think says a lot about you and this community. I truly sympathize with the sentiment of your intro and I’ve already recommended your pool to a few people

Well done and thank you again!

brouwerQ · 5 August 2021 12:55

Wouldn’t this be ‘safer’ to just leave on -I0 for a block producer? Or do you know a reason that -I0 will give problems in combination with your other tweaks?

How should it adapt to the number of CPUs configured?

_2072 · 5 August 2021 22:51

Setting -I0 disables the periodic garbage and therefore nullify the effect of the other parameters. With -I0, the garbage collector happens only when it’s too late to keep the memory usage low and will, as I explained, make the node use more memory each time.

I’m using these parameters on my producer node and I never miss blocks (except at the epoch transition like everyone else).
Also, performing regular GCs ensures that they stay short and thus prevent longer pauses. If GCs are performed only when the heap is exhausted then you take the risk of missing blocks every time a major GC is triggered (depending on the speed of your CPUs and RAM).

From what I could understand, -A16m asks the RTS to allocate 16 Mib per core thus, the more cores you use (with -N) the more memory it will use. In fact there is another parameter -n to divide this amount in smaller chunks when running on several cores, it’s supposed to prevent triggering GCs when certain threads are allocating faster than other cores… It’s woth trying but I havent played with this for now.

Abobolaispool · 31 August 2021 00:43

Do you use it on BP node too?

Abobolaispool · 1 September 2021 01:46

Can you use it on BP node too?

kaverne · 2 September 2021 00:13

Thanks a lot for the clean documentation and the insight @_2072
I’m currently testing your parameters and reading the GHC doco too.
I can see you have the the same architecture as mine too

kaverne · 3 September 2021 07:09

Just to give an update to the community and to @_2072.
This is working great, and you advices are working really well. Reading this doco also was a great help understanding this part of GHC that I never did. Thanks heaps mate !!!

pgwad · 7 September 2021 05:15

Some pools from armada-alliance have been using these options and did some experiments also.

Details here

This is working well on Raspi for 1.29.0 also. Thanks
A shoutout video here

_2072 · 7 September 2021 10:08

Thanks for the feedback!

A few remarks on this great Google sheet:

You should include the version of GHC used to compile the node as the RTS is a part of GHC.
using -I0 effectively disables the periodic GC so using -Iw600 with -I0 should have no effect (unless there is an undocumented behavior of -Iw)
The options you add between the +RTS -RTS arguments are added to the default ones the node is compiled with so not using -T, -A16m or --disable-delayed-os-memory-return should have no effect (I’ve added them for consistency)
It can take up to 16 hours or even several days to see the effectiveness of these parameters.

Also node that version 1.29 is using on average 500M more RAM than 1.27 (check the live data afer a major GC to see the base usage which is around 2500M for 1.29) so you should replace -H2500M with -H3G. Having this value too low or a -F smaller than 1.5 will trigger many more small GCs and slow down the node as it will spend most of its time garbage collecting instead of doing useful stuff. (you can notice this during startup where you want the smallest number of major and small GCs)

One of my relays is a Raspberry Pi (using Raspbian with the 64bit kernel) and I am using the following with 1.29 compiled with GHC 8.10.7:

+RTS -N4 --disable-delayed-os-memory-return -I0.3 -Iw300 -A16m -n4m -F1.5 -H3G -T -S -RTS

The -n4m is dividing the -A16m by blocks of 4m and allowing cores that exhaust their nursery to use other cores’ unused 4m blocks before triggering a small GC which is useful when a thread is doing many allocations while others are idle. The documentation is unclear on whether -n4m is already a default or not but from my observation it does seem to decrease the time spent doing GCs a little bit but more tests would be required to be certain.

I use -Iw300 to limit major GCs caused by heap exhaustion as much as possible (this is when the RTS will allocate more ram and never release it).

One could also play with -I and use -I0.1 to increase the number of opportunities to run the major GC set with the -Iw parameter (-I0.3 -Iw300 tells the RTS to do a major GC if the node is idle for at least 0.3 seconds and if the last one was made at least 5 minutes ago)

I see that several people tried the -c parameter, it’s very effective at diminishing the memory usage but the cost in CPU is huge and missed block and unresponsiveness are inévitable with this parameter (unless you have very fast cpu).

Louis · 23 September 2021 22:09

Thank you for sharing!!

I noticed you’re using ghc-8.101.5. Did you have any issues building with this version?

_2072 · 23 September 2021 23:19

No issue, I’m now using 8.10.7 without issue either.

Louis · 23 September 2021 23:52

Cool, 8.10.7 wouldn’t let me compile Wallet, just yesterday. But I’ll try going with the bleeding edge on my workers.

Topic		Replies	Views
Ups and downs of performance Operate a Stake Pool	10	540	18 August 2021
High memory usage Operate a Stake Pool	7	542	1 September 2022
Speed up BP and relay nodes Operate a Stake Pool	1	832	19 August 2021
Relay using a lot of CPU, a lot Setup a Stake Pool	34	1357	10 November 2021
Is my BP node running normal? Setup a Stake Pool	10	772	21 September 2022

Solving the Cardano node huge memory usage - done

Related topics