Solving the Cardano node huge memory usage - done

Do you use it on BP node too?

Can you use it on BP node too?

Thanks a lot for the clean documentation and the insight @_2072
I’m currently testing your parameters and reading the GHC doco too.
I can see you have the the same architecture as mine too :wink:

1 Like

Just to give an update to the community and to @_2072.
This is working great, and you advices are working really well. Reading this doco also was a great help understanding this part of GHC that I never did. Thanks heaps mate !!!

1 Like

Some pools from armada-alliance have been using these options and did some experiments also.

Details here

This is working well on Raspi for 1.29.0 also. Thanks
A shoutout video here

Thanks for the feedback!

A few remarks on this great Google sheet:

  • You should include the version of GHC used to compile the node as the RTS is a part of GHC.
  • using -I0 effectively disables the periodic GC so using -Iw600 with -I0 should have no effect (unless there is an undocumented behavior of -Iw)
  • The options you add between the +RTS -RTS arguments are added to the default ones the node is compiled with so not using -T, -A16m or --disable-delayed-os-memory-return should have no effect (I’ve added them for consistency)
  • It can take up to 16 hours or even several days to see the effectiveness of these parameters.

Also node that version 1.29 is using on average 500M more RAM than 1.27 (check the live data afer a major GC to see the base usage which is around 2500M for 1.29) so you should replace -H2500M with -H3G. Having this value too low or a -F smaller than 1.5 will trigger many more small GCs and slow down the node as it will spend most of its time garbage collecting instead of doing useful stuff. (you can notice this during startup where you want the smallest number of major and small GCs)

One of my relays is a Raspberry Pi (using Raspbian with the 64bit kernel) and I am using the following with 1.29 compiled with GHC 8.10.7:

+RTS -N4 --disable-delayed-os-memory-return -I0.3 -Iw300 -A16m -n4m -F1.5 -H3G -T -S -RTS

The -n4m is dividing the -A16m by blocks of 4m and allowing cores that exhaust their nursery to use other cores’ unused 4m blocks before triggering a small GC which is useful when a thread is doing many allocations while others are idle. The documentation is unclear on whether -n4m is already a default or not but from my observation it does seem to decrease the time spent doing GCs a little bit but more tests would be required to be certain.

I use -Iw300 to limit major GCs caused by heap exhaustion as much as possible (this is when the RTS will allocate more ram and never release it).

One could also play with -I and use -I0.1 to increase the number of opportunities to run the major GC set with the -Iw parameter (-I0.3 -Iw300 tells the RTS to do a major GC if the node is idle for at least 0.3 seconds and if the last one was made at least 5 minutes ago)

I see that several people tried the -c parameter, it’s very effective at diminishing the memory usage but the cost in CPU is huge and missed block and unresponsiveness are inévitable with this parameter (unless you have very fast cpu).

6 Likes

Thank you for sharing!!

I noticed you’re using ghc-8.101.5. Did you have any issues building with this version?

No issue, I’m now using 8.10.7 without issue either.

Cool, 8.10.7 wouldn’t let me compile Wallet, just yesterday. But I’ll try going with the bleeding edge on my workers.

Thanks for sharing valuable insight here.

Not sure if I missed somewhere in your post. What is your node hardware configurations? For example, what is the memory size, CPU cores, etc?

I tried some of the suggested setting and notice that idle GC (-I) causes many missed slot leader check, which is not a good thing. I can also confirm that “-c” (compact) GC also causes a lot of missed slot leader check, although both seem to be pretty in controlling memory usage.

Has anyone tried --nonmoving-gc option? This seems to be what we want, performing GC while still running the program.

My relays have 8GB + zramswap 2 or 4 cores
my producer has 16Gb (8Gb is too dangerous since 1.29) with 4 cores at 2.4Ghz

There is a current oddity with 1.29: if you restart your node after ~2 days into the epoch (I think it’s when the reward calculations are started but this is only a guess) it will take a very long time to start and use 1Gb of supplementary memory (node live data will stay at 4Gb after a major GC). The RAM usage will drop again after about 24 hours. When the node is in this state, a full GC will take more than 1s on my producer and I will also miss leader checks.
As long as the live data is below 3.5Gb, a full GC will take less than 1s and no leader check will be missed on my hardware configuration.
So if you are missing blocks when a full GC is done, it means that your CPU/RAM is too slow (you can check the time taken by garbage collections by looking at the stderr output if you use the -S RTS parameter).

On my producer node, I’m currently experimenting with -Iw3600 so as to force a major GC only every hour at most to lower the chance of missed blocks when the node decides to go above 3Gb of live data…
I’m also trying higher values of -H if you have plenty of RAM you can set it to a high value (corresponding to the peak observed heap size when using -I and -Iw parameters as it will limit major GC caused by heap exhaustion during peak RAM usage.
Ideally you want a major GC to occur once the RAM peak usage is over and all the memory allocated is not referenced internally anymore so as to limit the quantity of RAM to be copied by the GC . The garbage collector has to “move” all the remaining non-garbage data (the less data to copy the faster the GC).

Yes this seems great but alas it’s not working as advertised (yet):

  • I’ve seen the node crash with segfault sometimes when using this option.
  • GCs statistics seem wrong (time reported during execution does not seem accurate compared to the summary when the process exits)
  • Memory fragmentation occurs (in one of my tries almost, 2Gb of memory were wasted)
  • Blocks are missed outside of GCs for unknown reasons…
1 Like

I have ran nodes with --nonmoving-gc option for about 11 hours and so far have observed remarkable results. See the attached picture. In this 11 hours there is no interruption in grafana curves. Also missed leader slot checking is 0. By contract, without this option, I had many interruptions in the grafana curves and had 679 (1.6651%) missed slot checking for similar length of running.

I will continue running and see if I see any issues reported by @_2072

My configuration is 16GB, 4 core host running 3 containers (core/producer, relay, webserver), with both in-container firewalls and outside-container host-based firewalls.

I spoke a little too early. Shortly after my last post I started to see some crazy behavior on relay. Basically the memory usage would shoot up to very high number and the cause cardano node to crash/restart. See pic below.

image

zramctl --find --size 1G --streams 2 --algorithm lz4
1 Like

https://www.google.com/search?q=linux+swap+memory+digitalocean&client=safari&channel=iphone_bm&sxsrf=AOaemvLErvEZwJloCPnC5s0nOt9kXpXkpA%3A1632866997726&ei=tZJTYd3eK5mL9u8P38urcA&oq=linux+swap+memory+digita&gs_lcp=ChNtb2JpbGUtZ3dzLXdpei1zZXJwEAEYATIFCCEQoAEyBAghEBU6BAgAEEc6BQgAEMsBOgYIABAWEB46CAghEBYQHRAeUKVWWJlhYJ9saABwAXgAgAGdAYgBggaSAQMzLjSYAQCgAQHIAQjAAQE&sclient=mobile-gws-wiz-serp

The non-moving GC works well with 8.10.7, no segfaults so far. Yes, the GC statistics are wrong, including slop (which measures the fragmentation). Blocks can be missed outside GCs because the GC is competing with the other threads. Limit the GC to N-1 cores to avoid that. The only problem is surviving memory spikes, which tend to be much larger with 1.30.1. For that you should use zswap (as a cache for your swapping device). Don’t bother with zram.

1 Like

Thank you so much :slight_smile:

What do I need to set the -A16m parameter to if I want to use 4 cores?

Great analysis an explanation! Many thanks for this contribution.

I would like to know how can we ensure that the the modification is effective (other than monitoring the memory usage)? Does it show with the command “cardano-node +RTS --info” or is it only useful to know the default parameters applied when the node was compiled?

I’ve tried adding +RTS -N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S -RTS in my startReayNode1.sh script and it looks like this:

#!/bin/bash
DIRECTORY=/home/cardcasb/cardano-my-node
PORT=6000
HOSTADDR=0.0.0.0
TOPOLOGY=${DIRECTORY}/mainnet-topology.json
DB_PATH=${DIRECTORY}/db
SOCKET_PATH=${DIRECTORY}/db/socket
CONFIG=${DIRECTORY}/mainnet-config.json
/usr/local/bin/cardano-node run +RTS -N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S -RTS --topology ${TOPOLOGY} --database-path ${DB_PATH} --socket-path ${SOCKET_PATH} --host-addr ${HOSTADDR} --port ${PORT} --config ${CONFIG}

But my relay is still consuming all of the 8G RAM and the 4G swap rather quickly (couple of hours) until it becomes slow and unresponsive up to the crash of the cardano-node process that listens to the 6000 port then my relay becomes unreachable.

Am I missing something?

How can i disable RTS output log ? after use rts in my node runner script return something like this
Oct 11 14:08:04 cardano cardano-node[7996]: 1007471536 9553312 11641848 0.135 0.029 65.669 63.605 0 0 (Gen: 0)
i want to remove this output because my log is unclear