Solving the Cardano node huge memory usage - done

the moment when you reboot the node should not affect … I’m rebooting the relays once/12 hours and I didn’t saw any differences …

I guess you’re not rebooting all of your relays at the same time.
Do you reboot them to apply the topology update or for another reason?
Are you doing it manually or automatically through a script?

of course not all same time, I restarting for the topology, I know it’s to often but this operation is not affecting my pool… and yes the reboot makes automatically

Thank you for sharing…I was imagine something like that to be done to solve this problem.
The real problem here is the use of shitty languages that has GC…a good old C would have never created such issue!

@AlterX I guarantee we have both seen misuse of malloc :wink:

Every language has trade-offs between it’s strengths and weaknesses.

1 Like

Quite different…a misuse of malloc is only happening when developer is not experienced, but if he is, software is strong. On the other side, with such toys, you don’t have any choices. In my opinion, a server that has real-time character and network intensive work should never be even imagined to be designed with a GC…for desktop or simple app it is fine though

For those using cntools, did you update the ${CPU_RUNTIME} var in cnode.sh? I assume that the “Do NOT modify code below” comment in the file means that anything below will be written over in an update.

Edit: Here’s the line in cnode.sh that I’m referring to:

[[ -n ${CPU_CORES} ]] && CPU_RUNTIME=( “+RTS” “-N${CPU_CORES}” “-RTS” ) || CPU_RUNTIME=()

Well you can give them a bit of credit and the benefit of the doubt at least :wink:
They have made it that far with the full project, so there might be reasons we are not aware of/exposed to.

Hey guys,

I updated my relays to 1.31.0, both are VPS, one is a 4 core with 8GB of RAM from G-Core Labs and the other is an 8 core 16 GB of RAM from Digital Ocean.

The node with 8 GB was already at 99% of memory usage with node 1.30.1 but during the update I realized that the node was not syncing anymore. It recovered from time to time, but it didn’t really work.

The 16 GB node @ 1.31.0 was at 76,7% of memory usage.

So, I found this interesting topic about this problem, and I also checked the linked Google doc.

If I’m not mistaken, QXT pool had the solution, the operator called it the Holy Grail , I couldn’t agree more :blush:

After I used the QXT pool settings the memory usage of the 8 GB pool dropped to 76% and the 16 GB node dropped to 39,7%.

I even used these settings on my bare metal production server with 64GB of RAM and the memory usage of that node dropped from 17,2% to 10,8%.

The 8GB 4 core nodes is running with this option:

+RTS -N2 --disable-delayed-os-memory-return -A64m -F1.2 -qg1 -n4m -M8000M -RTS

And the 16GB 8 core node with this one:

+RTS -N4 --disable-delayed-os-memory-return -A64m -F1.2 -qg1 -n4m -M8000M -RTS

I use CNTOOLS on the production node and I modified line 12 in cnode.sh like so:

CPU_CORES=4

And line 28 like so:

[[ -n ${CPU_CORES} ]] && CPU_RUNTIME=( “+RTS” “-N${CPU_CORES}” “–disable-delayed-os-memory-return” “-A64m” “-F1.2” “-qg1” “-n4m” “-M8000M” “-RTS” ) || CPU_RUNTIME=()





For now, everything works like a charm, thank you for your contributions and kudos to QXT pool.
Good luck with your pools and happy minting.

1 Like

Thanks for sharing your parameters. Are you monitoring the number of missed slots on your producer with these settings? -M8000M sets the -c30 flag automatically which will switch to a compaction GC for the oldest generation when 30% of 8000 will be reached which is the slowest way to do garbage collection (it’s possible for the node to freeze for more than 10s when doing garbage collection with this).

2 Likes

Thanks to this post by weebl2000 I’ve discovered a new parameter that I missed before: -O.

With this and -F it should be possible to set a soft limit on the maximum memory usage:
-O sets a minimum threshold that needs to be reached before harvesting the older generations.
This has 2 effects:

  • you can limit the amount of useless GC at startup and so decrease the node startup time.
  • If set properly it prevents useless major GCs that would regain no memory and only pause the world and waste precious time.

With the -F parameter you can limit the increase of the heap size after a major GC thus preventing useless increases of the heap which is the main issue discussed in this topic.

The problem now is to find the proper value to use with -O. Basically this should be the maximum amount of useful live data needed by cardano-node to run + a margin. This value can be found by surveying the maximum amount of live data right after a major GC.
For example right now, this value is around 3100MB. So -O4G would be a good starting point.

I will do some tests and report here what I find by next week. I want to see the maximum memory usage happening during reward computations happening later in the epoch.

This would result in this kind of parameters:

-N -I0.1 -Iw7200 -A128m -n4m -F1.2 -H6G -O4500G -T -S we could even imagine setting -F to less than 1 to reclaim memory… (untested)

2 Likes

GHC will never return megablocks to the OS.

Hey Everyone, thanks for the updates and test permutation so far. I am running 10+ relays and am also seeing variations in memory usage across the fleet. Like others, I am running version on 16gb machines which are approaching >90% usage on the machines (including other services). @_2072 I am keen to seen the results of your tests and would like to help test some configurations.

Memory usage percentiles for only the cardano-node process (version 1.30.1):
image

Well presently I’m testing with these 2 settings:

on 8GB relays:

-N -I0.1 -Iw7200 -A128m -n4m -F0.7 -H3500M -O4500M -T -S the lower than 1 value for -F prevents the heap from increasing and should keep it just under 8GB. I’m also testing with -F0.9 -O4000M on other 8GB relays.

on my 16GB producer, I’m testing with these settings:

-N -I0.1 -Iw42000 -A128m -F0.5 -H3500M -O8000M -T -S The idea being to let the heap grow very large (above 8GB) and only keep half of the Heap (since 8G is more than twice the node’s actual needs). I’ve been running the producer like this for almost 5 hours, no major GC happened yet, so I need to wait more to see what happens after a major GC (does the RTS tries to allocate 16 GB to do its copy of the still referenced live data or just what’s necessary? Does it manages to relinquish memory to the OS with -F0.5?)

1 Like

I am testing those RTS options on my setting (bp and relay - both with 8 core, 32GB ram), however got during the 24h window

bp
Missed slot leader checks : 710 (0.8679 %)
GC Minor : 1295
GC Major : 22

relay
GC Minor : 1674
GC Major : 21

are you using -xn or disable-delayed-os-memory-return for your setup? how is going the test?
Screen Shot 2021-11-23 at 17.47.55

1 Like

I’m not using -xn as it is bugged (even with GHC 8.10.7). I do use disable-delayed-os-memory-return but it’s enabled anyway at compile time (so one can’t really disable it as there is no ‘conter’ option).

For now the result are mixed. I think I stumbled onto yet another weirdness/bug of the RTS:

  • The relays with -F0.9 -O4000M are performing the best with their resident memory remaining just below 8GB at ~7500MB since 33 hours and their maximum useful live data at 3,634,448,384 bytes with the GC taking ‘only’ 20% of the CPU time.

  • The producer node however, the one with -F0.5 -H3500M -O8000M behaved very well until the first major GC, no missed slots for 12 hours, with only 5% cpu wasted by the GC, 2 missed slots on the first GC (to be expected unfortunately) but then things got weird:


There are many metrics in this (cropped) screenshot, the one of interest here are: Up: (node uptime) rM: (resident memory), H: (Heap), mH: (max heap size, L: (Live data), ML: (max live data) MuL: (Maximum useful live data), Missed Slots: (partly cropped in this screenshot) GCs: (#-#-#%) where the first number is the number of major GCs the second the number of minor GCs and the percentage is the ‘inefficiency’ of the GC. Most of these stats are sampled or computed from the EKG every 2 seconds and there is 10 seconds between each output of these stats.

When the 1st full GC happened, the expected amount of memory was deduced from the live data (down from 7,917,599,368 to 3,365,067,040) and the Heap was increased to 12GB (from 9GB) but during the next 20 seconds, that memory was “reclaimed” and live data increased back to 7,741,731,392… And afterward major GCs were performed every hour or less since it was always around the limit! (with 1 or 2 missed slots at almost every major GC). Each time the live data would drop to the MuL value but would rapidly come back up almost immediately, like in the above screen shot, or after a few minutes but every time in just 20 seconds…

So after these initial 12 hours of normal and expected behavior (garbage slowly accumulating until it reaches the limit set by -O8000M or the -Iw42000 timeout) the RTS began behaving strangely negating the gain -O could have provided… Yet another bug of GHC’s RTS. But it is to be expected as the RTS is coded in C and these combination or parameters are probably mostly untested…

After this failure I’ve made another test removing most of the other parameters (-I, -Iw, -H, ending with just -A128m -F0.3 -O8000M -T -S) but the result were the same after the first major GC.

The GC statistics output at exit revealed nothing special concerning the wasted memory:

image

The only positive point in this is that if you want your node to be GC free for 12 hours you can use -O8000M so that the GC will leave your node alone until that limit is reached, then you can restart the node before this limit is reached in between your block production schedule…

So quite disappointing, not to mention that like @weebl2000 said, GHC is completely unable to release memory to the OS… (I’ve started another test with -F0.3 -O8000M to see what happens). In GHC 9 they introduced a dynamic reduction of the factor set with -F so that GHC could release memory to the OS so maybe they changed/fixed something in GHC 9 or maybe it doesn’t work either.

3 Likes

For those who are interested, the strange behavior of the -O8000M flag looks like this in the stderr output provided by the -S flag:

    Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
    bytes     bytes     bytes   user   elap     user     elap

546401512   4195584 7919906032  0.030  0.008 1772.332 41817.107    0    0  (Gen:  0)
566599128   2553464 7919730032  0.023  0.006 1772.670 41817.411    0    0  (Gen:  0)
567433568   1209344 7919723256  0.021  0.005 1772.967 41817.687    0    0  (Gen:  0)
586252456 234519784 8154491520  2.085  0.522 1775.349 41818.499    0    0  (Gen:  0)
569601944 3110288912 3365067040  8.369  2.467 1784.642 41821.225    0    0  (Gen:  1)
583965072 121505184 3487937976  0.912  0.228 1785.933 41821.769    0    0  (Gen:  0)
565142704 241901168 3729586760  1.517  0.383 1787.793 41822.467    0    0  (Gen:  0)
588137112 223099840 3954810320  1.231  0.310 1789.316 41823.054    0    0  (Gen:  0)
545991120 136143472 4096223352  0.827  0.208 1790.549 41823.645    0    0  (Gen:  0)
546306712 136377912 4237857344  0.593  0.158 1791.496 41824.131    0    0  (Gen:  0)
549509520 134826448 4377880856  0.942  0.237 1792.854 41824.729    0    0  (Gen:  0)
550873784 137740712 4520975648  0.938  0.232 1794.177 41825.341    0    0  (Gen:  0)
547948896 137470128 4663772120  0.988  0.248 1795.563 41825.977    0    0  (Gen:  0)
548701672 136285912 4805398376  0.619  0.155 1796.583 41826.520    0    0  (Gen:  0)
545828256 137430056 4948093408  0.779  0.197 1797.800 41827.158    0    0  (Gen:  0)
550208424 136993648 5090488816  0.588  0.149 1798.785 41827.675    0    0  (Gen:  0)
545375232 134978760 5230862040  0.810  0.217 1799.967 41828.250    0    0  (Gen:  0)
543640608 135605056 5371843120  0.911  0.229 1801.207 41828.808    0    0  (Gen:  0)
548709272 138269440 5515398144  0.496  0.126 1802.046 41829.275    0    0  (Gen:  0)
551048856 140144744 5660803424  0.568  0.146 1802.966 41829.754    0    0  (Gen:  0)
544582280 136162448 5802302720  0.647  0.162 1803.952 41830.259    0    0  (Gen:  0)
544042480 137070016 5944710768  0.473  0.119 1804.743 41830.694    0    0  (Gen:  0)
550192016 140055504 6090013632  0.672  0.168 1805.761 41831.197    0    0  (Gen:  0)
553013192 131865824 6227102208  0.595  0.149 1806.785 41831.694    0    0  (Gen:  0)
551344288 141332000 6373395280  0.608  0.156 1807.710 41832.161    0    0  (Gen:  0)
549306136 135529720 6514178424  0.747  0.173 1808.878 41832.716    0    0  (Gen:  0)
547653744 137011528 6656399144  0.657  0.174 1809.886 41833.222    0    0  (Gen:  0)
547552624 133739128 6795501984  0.508  0.127 1810.800 41833.719    0    0  (Gen:  0)
550859832 213176248 7009933192  1.086  0.273 1812.159 41834.265    0    0  (Gen:  0)
560284400 240660968 7250600304  1.560  0.389 1813.975 41834.914    0    0  (Gen:  0)
563213776 240866184 7491532160  1.272  0.321 1815.676 41835.591    0    0  (Gen:  0)
565326416 249643008 7741731392  2.143  0.543 1818.372 41836.646    0    0  (Gen:  0)
552418672  28531440 7772869896  0.113  0.037 1820.075 41866.114    0    0  (Gen:  0)
542119680   3411208 7776548600  0.022  0.012 1821.958 41915.766    0    0  (Gen:  0)
541381936   3822440 7779850624  0.019  0.007 1823.767 41957.435    0    0  (Gen:  0)
541029448   4167272 7784305360  0.017  0.005 1825.714 42018.423    0    0  (Gen:  0)
537969352   4950168 7788617160  0.033  0.010 1827.265 42071.112    0    0  (Gen:  0)
541487272   4154264 7791133680  0.024  0.011 1828.716 42118.538    0    0  (Gen:  0)

You can see that the first major GC gets back down to 3365067040 bytes of live data and then, in less than 15s (from 41821s to 41836s) it reallocates all the freed memory in chunks of around 130MB. I don’t know what it means, 130Mb is probably linked to -A128M… An edge case of the RTS tripping over this combination of parameters?
Or is there a change of behavior in cardano-node after about 12 hours of uptime that creates those periodic very large memory allocations? It would be a strange coincidence but it’s possible…

It would be interesting if someone with a lot of memory could set -O20G for example (that should be about 24 hours without major GC) and see if these large allocation happen before the first major GC. (about 3.5 GB allocated in 15s or less)

This behavior happens either right after a major GC or later, such as 1 hour later.
I’ve also tried to use a very high -O value and a -Iw timeout so that the timeout would trigger before the -O limit but the same thing happened.

My other attempts with very low -F flag did not yield any conclusive result (no freed memory).

The only useful stuff for now, for producer nodes, is that you can set the -O flag to a very high value so that the GC leaves you alone, then you can periodically restart the node, for example one hour before a scheduled block to make sure you will not be bothered (or using more complex strategies tied to ekg telemetries).

I’m doing other tests with relays and other combination of flags involving -O, I’ll post the result once I have something interesting.

1 Like

After 12h of uptime, the node begins to perform incremental ledger snapshot calculations every 72 mins. This is why you see large spikes in memory utilization (and normally a few major garbage collections if you don’t use a ridiculously high -O value). See here for more details:

1 Like

My pleasure, yes, I am monitoring, but I found out after 2 days, my vps relays had availability issues due to these parameters, so I rolled back the settings to

+RTS -N4 - RTS

only before I had to mint a block, just to make sure not to miss one. The memory usage improvement is huge though. Need to do further testing.

Do you have a reference for this 12h delay?
In your particular case you might also have storage IO latency issues because for my producer, the only cause of missed slots are major GCs causing a stop-the-world pause longer than 1s (you can see the pause duration in the GC elap column of the -S output). This happens every time the old generation becomes too large and the amount of memory to copy is bigger than a certain threshold (determined by the hardware, cpu/ram speed).

I’m going to try the following experiments:

  • Running a producer with a very large -O, almost as big as memory + swap and a large nursery allocation area (-A256M or more) to limit gc promotion as much as possible and -I0 (no periodic major GC). I want to see how things behave when the garbage is put into the swap instead of being collected…
  • Still using -O, increasing the number of generations (to limit the amount of RAM to analyse by each generation garbage pass) with a -F close to 1 and still without periodic GC which would trigger an undue analysis of all the generations…
1 Like