Solving the Cardano node huge memory usage - done

Thanks for sharing valuable insight here.

Not sure if I missed somewhere in your post. What is your node hardware configurations? For example, what is the memory size, CPU cores, etc?

1 Like

I tried some of the suggested setting and notice that idle GC (-I) causes many missed slot leader check, which is not a good thing. I can also confirm that “-c” (compact) GC also causes a lot of missed slot leader check, although both seem to be pretty in controlling memory usage.

Has anyone tried --nonmoving-gc option? This seems to be what we want, performing GC while still running the program.

1 Like

My relays have 8GB + zramswap 2 or 4 cores
my producer has 16Gb (8Gb is too dangerous since 1.29) with 4 cores at 2.4Ghz

There is a current oddity with 1.29: if you restart your node after ~2 days into the epoch (I think it’s when the reward calculations are started but this is only a guess) it will take a very long time to start and use 1Gb of supplementary memory (node live data will stay at 4Gb after a major GC). The RAM usage will drop again after about 24 hours. When the node is in this state, a full GC will take more than 1s on my producer and I will also miss leader checks.
As long as the live data is below 3.5Gb, a full GC will take less than 1s and no leader check will be missed on my hardware configuration.
So if you are missing blocks when a full GC is done, it means that your CPU/RAM is too slow (you can check the time taken by garbage collections by looking at the stderr output if you use the -S RTS parameter).

On my producer node, I’m currently experimenting with -Iw3600 so as to force a major GC only every hour at most to lower the chance of missed blocks when the node decides to go above 3Gb of live data…
I’m also trying higher values of -H if you have plenty of RAM you can set it to a high value (corresponding to the peak observed heap size when using -I and -Iw parameters as it will limit major GC caused by heap exhaustion during peak RAM usage.
Ideally you want a major GC to occur once the RAM peak usage is over and all the memory allocated is not referenced internally anymore so as to limit the quantity of RAM to be copied by the GC . The garbage collector has to “move” all the remaining non-garbage data (the less data to copy the faster the GC).

Yes this seems great but alas it’s not working as advertised (yet):

  • I’ve seen the node crash with segfault sometimes when using this option.
  • GCs statistics seem wrong (time reported during execution does not seem accurate compared to the summary when the process exits)
  • Memory fragmentation occurs (in one of my tries almost, 2Gb of memory were wasted)
  • Blocks are missed outside of GCs for unknown reasons…
2 Likes

I have ran nodes with --nonmoving-gc option for about 11 hours and so far have observed remarkable results. See the attached picture. In this 11 hours there is no interruption in grafana curves. Also missed leader slot checking is 0. By contract, without this option, I had many interruptions in the grafana curves and had 679 (1.6651%) missed slot checking for similar length of running.

I will continue running and see if I see any issues reported by @_2072

My configuration is 16GB, 4 core host running 3 containers (core/producer, relay, webserver), with both in-container firewalls and outside-container host-based firewalls.

1 Like

I spoke a little too early. Shortly after my last post I started to see some crazy behavior on relay. Basically the memory usage would shoot up to very high number and the cause cardano node to crash/restart. See pic below.

image

1 Like
zramctl --find --size 1G --streams 2 --algorithm lz4
2 Likes

https://www.google.com/search?q=linux+swap+memory+digitalocean&client=safari&channel=iphone_bm&sxsrf=AOaemvLErvEZwJloCPnC5s0nOt9kXpXkpA%3A1632866997726&ei=tZJTYd3eK5mL9u8P38urcA&oq=linux+swap+memory+digita&gs_lcp=ChNtb2JpbGUtZ3dzLXdpei1zZXJwEAEYATIFCCEQoAEyBAghEBU6BAgAEEc6BQgAEMsBOgYIABAWEB46CAghEBYQHRAeUKVWWJlhYJ9saABwAXgAgAGdAYgBggaSAQMzLjSYAQCgAQHIAQjAAQE&sclient=mobile-gws-wiz-serp

1 Like

The non-moving GC works well with 8.10.7, no segfaults so far. Yes, the GC statistics are wrong, including slop (which measures the fragmentation). Blocks can be missed outside GCs because the GC is competing with the other threads. Limit the GC to N-1 cores to avoid that. The only problem is surviving memory spikes, which tend to be much larger with 1.30.1. For that you should use zswap (as a cache for your swapping device). Don’t bother with zram.

2 Likes

Thank you so much :slight_smile:

What do I need to set the -A16m parameter to if I want to use 4 cores?

1 Like

Great analysis an explanation! Many thanks for this contribution.

I would like to know how can we ensure that the the modification is effective (other than monitoring the memory usage)? Does it show with the command “cardano-node +RTS --info” or is it only useful to know the default parameters applied when the node was compiled?

I’ve tried adding +RTS -N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S -RTS in my startReayNode1.sh script and it looks like this:

#!/bin/bash
DIRECTORY=/home/cardcasb/cardano-my-node
PORT=6000
HOSTADDR=0.0.0.0
TOPOLOGY=${DIRECTORY}/mainnet-topology.json
DB_PATH=${DIRECTORY}/db
SOCKET_PATH=${DIRECTORY}/db/socket
CONFIG=${DIRECTORY}/mainnet-config.json
/usr/local/bin/cardano-node run +RTS -N2 --disable-delayed-os-memory-return -I0.3 -Iw600 -A16m -F1.5 -H2500M -T -S -RTS --topology ${TOPOLOGY} --database-path ${DB_PATH} --socket-path ${SOCKET_PATH} --host-addr ${HOSTADDR} --port ${PORT} --config ${CONFIG}

But my relay is still consuming all of the 8G RAM and the 4G swap rather quickly (couple of hours) until it becomes slow and unresponsive up to the crash of the cardano-node process that listens to the 6000 port then my relay becomes unreachable.

Am I missing something?

1 Like

How can i disable RTS output log ? after use rts in my node runner script return something like this
Oct 11 14:08:04 cardano cardano-node[7996]: 1007471536 9553312 11641848 0.135 0.029 65.669 63.605 0 0 (Gen: 0)
i want to remove this output because my log is unclear

1 Like

Are you using the -S option between your RTS tags? If so, you can remove it is about garbage collection statistics.

3 Likes

I’m not sure if this is required but you should put the run argument after the +RTS ... -RTS then you can use top or htop to check that the cardano-node process is indeed running with those parameters.
I have a relay with 8GB of ram that has been running just fine for the last 20 days with those parameters with a max heap of 7.8GB with 1Gb of swap usage. (I haven’t set up zramswap on that relay)

Also note that if you restart a node at a certain time in the epoch (I think it’s between +2 days and +3 days) your node will take a long time to start and use a lot more ram for a few days…
In any case, with 8GB of ram you need a fast swap partition on SSD or nVME or the node will block each time the swap has to be used… You can also set up zramswap but it will downgrade the performance and response time of your node if your CPU is not fast enough…

1 Like

Thanks for your feedback, now it has been stable for the last 2 epochs, I can confirm that the relay is running stable with the same values you just described (7.8G max heap and around 1.4G swap) but I have to say that when I restarted it 2 epochs ago, at the boundary, it took almost an hour to start so I don’t want to risk and restart it again. Maybe it was a timing issue and there are indeed bad timing to restart.
What is the best timing to restart the node and have minimum impact in your experience?
Now regarding the “run” position, it doesn’t seem to make a difference as the node starts and runs normally in both cases (shows in htop).
I’m curious to know whether using zram (or zramswap?) is really effective. I increased my swap from 4 to 8G (storage is SSD) along with some other measures while trying to address this issue like disabling TraceMempool parameter and stopping the Grafana/Prometheus processes to save some RAM (which I restarted later with no major noticeable impact as the node is running fine as I said).
I’m about to restart it in few hours. I’ll let you of the outcome.

@Sidou
Regarding restart @georgem1976 wrote some interesting info in telegram.

Basically usually when you stop service it will allow 2 seconds for grace shutdown, if it doesn’t happen then it will hard kill service. Cardano node needs more then 2 seconds and off cause it being killed. When cardano node starts up it detects that it wasn’t stopped normally and runs full check before startup and this seems takes time.

To mitigate it based on again @georgem1976 info, you need to change those 2 seconds to 120 seconds.

in your service file
/etc/systemd/system/cardano-node.service
or whatever the file is on your systems, set this:
TimeoutStopSec = 120
default is 2 seconds, I assume.
after editing it, run
systemctl daemon-reload

1 Like

Thanks os11k, that’s a very interesting info.
I understand that giving more time to the service to stop gracefully will save more time during the next startup but how did he base his decision on using 2 minutes instead of another shorter duration, say 60 seconds? Does the reload command take into account the whole 2 minutes or does it restart the node as soon as the termination is successful?
Also, is it worth it to apply it on the BP node as well?
I’ll give it a try.
Thanks again

Probably you can experiment with that value and try 60 or any other value what fits your needs. My understanding is if it takes more then 2 minutes service will be killed if not than it will not wait whole 2 minutes, but it takes as much as it needs, but not more then 2 minutes.

Unfortunately I can’t comment more then that on this topic. I do not use systemctl for my nodes, but rather docker.

1 Like

One more thing, it has been implied on other posts that this whole procedure won’t be really effective as the parameters must be changed at compile time in order for them to actually apply, especially the multicore option.

Is it true for all parameters or just some of them or not at all?

No, they are wrong, the node is compiled with the multicore option by default (you can check the sources).

When the node is starting up or at the epoch boundary or at some other times, it can get busy and will not answer to the kill signal until it’s done doing what it is doing, so this 2 minutes timeout tries to prevent killing the node when it is in that condition and avoids a 20 minutes startup the next time. Note that it can be way longer than 2 minutes depending on your CPU and server conditions.
Also note that the node must be stopped with a SIGINT and not a SIGTERM.

Just after the epoch boundary is OK until at least a full day into the epoch, then there is a period of 1 to 2 days where you must not restart the node or you will see the problem. This is a rough estimate as this is just something I noticed and I haven’t tried to find the exact interval.

1 Like

Yes I can see that now and so the other parameters can be modified at run time.

After applying the 120 seconds modification to both relay and BP node, I waited until reaching less than 2 days remaining in the epoch (1 day and some hours) to restart first the relay, then the BP and took them less than 3 minutes to be up and running and fully connected to peers!!! I don’t know how it was possible! Was it because of the 120s, the right timing during the epoch or the fact that I rebooted the node instead of just restarting the cardano service.

As a side note, I just want to mention that I tried the --nonmoving-gc parameter in the startup script of the BP node in order to reduce the missed slots checks (I wasn’t getting that much actually) before restarting. It resulted in higher RAM usage (roughly 3 to 5 G more) and maybe even more missed slots (6 in a couple of hours) so I removed that parameter and the BP went back to its normal heap memory usage (around 6.8 G instead of the 11 it ended up using with that parameter).