OK, new discovery: The ~130Mb allocations I observed above are indeed related to the -A flag and are in fact explained here: -AL ⟨size⟩. it corresponds to the large object allocation area (which is global and independent of the number of CPUs set with -N, contrary to -A). It is set to the -A value by default.
I could indeed confirm that as @orpheus-ant explained, every 72 minutes the node will allocate about 2.5Gb of memory triggering many small GCs and potentially a big GC if it manages to exhaust the older generation.
There is no RTS malfunction with the -O flag, just a coincidence (as explained above by @orpheus-ant )
So I’m am now trying with an -AL set to N*A this will divide the number of small GC and remove useless generation promotions during these big memory allocations by 4 in my case (since -N4) and enhance the performance of the process during these snapshots!
I’ve just restarted my producer so I will report in 15 hours… (I’m using -N4 --disable-delayed-os-memory-return -A256M -AL1G -n16m -F1 -O18G -M20G -Mgrace=1G -C0 -c99 -T -S in this test - I could confirm that -Mgrace allows the node to exit gracefully in case of heap exhaustion set with -M)
I’m not sure about the 12 hours, never heard that before. But the 72 mins is accurate. Apart from that a snapshot is taken at least every 50,000 blocks.
4320 seconds is actually the security parameter times two. (k=2160)
The 12h was just a reference to your post. Ledger snapshot calculations are not happening immediately after node startup, there is a delay. Not sure exactly how this delay from startup is calculated though.
Regarding missed slot leader checks, major GCs are a major contributor but not the only one. At least in my case, during a snapshot event, I have more missed slot leader checks than seconds spent doing GC. You can check to see if that is true in your case too (it only started since epoch 304). If it is, you are suffering from the same issue and thought the GC is the only reason. If it is not, then your CPU may be fast enough to handle everything fine on a single core. Mine is not, I am running on embedded bare-metal energy-efficient systems (with fast SSDs, so I/O is not causing any delay).
Regardless, the node could use some improvement regarding multi-core efficiency, now that CPU utilization is non-trivial, hence my GitHub issue.
@_2072: sounds good, give us an update on how this goes. Not sure why you set -F1 though. Is this your way of saying “ignore -F and trigger major GCs only based on -O”?
@weebl2000: the 12h mark was a reference to @_2072’s post. I just know there is a delay before snapshot events start. If you have any info on how that is calculated precisely, please share.
Yes normally -F is ignored until -O is reached, afterward, if F * ‘Size of last live data at last garbage collection’ is used if it’s bigger. Basically with -F1, I’m saying to the RTS that once you’ve reached the O limit there is not need to allocate more than we already have to delay the next GC…
Anyway I need to make another try without the -M option because it seems that something else is triggering a GC well below the -O limit (at exactly half of -M). It is said in the documentation that stuff is done when the M limit is “approaching”…
I could confirm that the snapshot does not trigger any missed slots if no major GC is performed from all the ‘garbage’ it creates.
The -AL1G flag does seem to improve things, less little GC, less promotion but the main issue remains… (2.5GB of generated data by snapshots that will have to be collected eventually)
You can get the specifics in the DiskPolicy that I linked in my previous post.
It boils down to:
On first start, a snapshot is taken as soon as the node sees 2160 (k) “blocksSinceLast”.
This means the number of immutable blocks that had to be applied to get to the chain tip.
After that, if we have made a snapshot already since last node startup it is either:
time since last snapshot (since startup) has been passed (default 72 min) OR if enough blocks have passed since last snapshot (hard-coded 50,000) AND those 50,000 blocks have not passed within 6 minutes (initial sync is happening, don’t snapshot more often than once every 6 minutes).
So, hypothetically: ledger snapshot + immutable DB are in perfect sync on startup, it will take k blocks before a snapshot happens. Or around 12 hours with current settings.
After a first snapshot is made (maximum of about 10 hrs in), it defaults to snapshotting every 72 minutes. If you have decent CPU/memory to replay blocks - it might be worth a try to set SnapshotInterval to 43200. Which would mean it will always take around 12 hrs between snapshots.
There is another thing to try if someone else has time: add a high -H with -O so that the allocation area can be very big and even contain the full snapshot when it is created this would avoid promoting temporary data from the snapshot only for them to be collected during the next major GC. Something like this:
-N4 -I0 -A256m -AL1G -n16M -F0.5 -H8000M -O8000M -T -S (for a server with 4 cores and 16gb of ram). -F0.5 will prevent the heap from increasing too much.
I’ll try this myself in a few days but for now I’m testing with -A256M -AL1G -n16m -G3 -F1.5 -H4G -O4G -M14G -Mgrace=1G -C0 -c99
Here is the content of that answer re-posted in the hope that contributing will help the search for a fix to this problem:
The things I have found:
-I0
(Default garbage collector with idle garbage collection off):
With this set I don’t get missed slots at all for many hours and then eventually the garbage collector has to run and copy a lot of memory and then I get many missed slots all at once.
-I0.3 -Iw600
(Default garbage collector with regular collector sweeps with minimum 10mins between sweeps when processor idle > 0.3 secs)
This results in few missed slots which occur when the garbage collector runs. However, because it is done more regularly the amount of memory copying is less and it takes less time so there are less missed slots each time. But for my setup there were still a couple of missed slots every 10 mins when the GC ran.
--nonmoving-gc
I still get occasional missed slots. Memory use seems to gradually increase over time and the node eventually uses all available RAM 16G and starts using swap. After running 20hrs rts_gc_peak_megabytes_allocated is 24G which seems excessive. Node appears to be running OK with almost no missed slots (4 in 20hrs).
If you run a small pool with limited processor power/speed then one option is to do the following:
Work out when your allocated slots are by running cncli-leaderlog ahead of time. Then plan to restart your block producer cardano-node service 20 minutes before you are due to produce your block. This way you can be sure your memory is fresh and garbage collection won’t be necessary for a while just at the time when you are due to produce your block.
Overall, throwing faster hardware at the problem is an easy fix. Faster processor means memory copying is quicker and garbage collection takes less time so missed slots are eliminated.
Update 2021-12-21:
After lots of testing I have concluded that the only RTS option which stops my node from missing slot leader checks is --nonmoving-gc. However the problem with this setting is that memory is never released back to the OS with GHC version 8.10.7. Unfortunately we need to wait until the cardano-node is able to be compiled using GHC version later than 9.2.x
What I currently use is:
+RTS -N --nonmoving-gc -RTS
As I understand things, the other settings compiled in (set by IOG in the source code) are also enabled. These include: “-I0 -A16m”. You can check what your binary has with:
cardano-node +RTS --info
The output for my version currently is:
Then I restart cardano-node every few days at a convenient time between block production.
Or check your memory usage periodically with: ‘free’
And restart cardano-node when you get close to exhausting your RAM, or after this but before too much swap is used.
If you just let cardano-node keep running using the nonmoving-gc then eventually all the RAM and swap is used and the OS kills the process using its out of memory reaper. I believe this is because GHC version 8.10.7 doesn’t return the collected memory to the OS.
Addition 2021-12-21:
OK now I am not sure about the last bit. Memory is eventually returned to the OS when using nonmoving-gc. Here are some outputs of ‘free’ while running continuously with:
‘+RTS -C0 -N -I0 --nonmoving-gc -RTS’
Tue 21 Dec 2021 10:04:57
total used free shared buff/cache available
Mem: 16393504 15851756 161420 20 380328 262232
Swap: 17039352 13271360 3767992
Tue 21 Dec 2021 22:02:44
total used free shared buff/cache available
Mem: 16393504 15325260 648972 20 419272 786832
Swap: 17039352 11186532 5852820
Note that 12hrs has elapsed and RAM and Swap usage has reduced a little.
Addition: I am running tests on two different machines:
First is:
Intel(R) Xeon(R) E-2276ML CPU @ 2.00GHz
With 6 cores and 2 threads per core = 12 CPUs
64Gb total RAM
Second is:
ARM Cortex-A72 @ 2.00GHz
With 16 cores and 1 thread per core = 16 CPUs
32Gb total RAM (Will upgrade to 64Gb soon)
In both instances, I am running cardano-node in kvm virtual machines with 16Gb RAM and 16Gb swap. The ARM VMs have 4 vCPUs whereas the Intel Xeon VMs have 2 vCPUs.
From all my tests. the best settings for 8GB relays are: -N2 -I0.1 -Iw3600 -A64M -AL128M -n4m -F1.1 -H3500M -O3500M -S
-I0.1 -Iw3600 turns on the idle GC to be triggered every hour as soon as the process is idle for 0.1s. This is useful because it prevents major GCs to be caused only by heap exhaustion which is guaranteed to increase the heap size (total memory usage) every time it happens (depending on the -F setting being used)
-A64 -AL128M sets the allocation area to 128M (for N=2, if N=4 it would be -A32M -AL128M). You want this allocation area to be small so that most of the memory is available for generation 1. -AL is useful to optimize the use of the allocation area when big allocations are made ; it should be set to A times N (it defaults to A which is suboptimal for parallel processes)
-n4m divides the allocation area into smaller 4 megabytes blocks (this setting does not have much impact with N=2 and could probably be left out)
-F1.1 this limits the growth of the memory when a GC is caused by heap exhaustion, this is key to limit the increase of memory usage, this setting works because of the high -O where 10% of 3500M is already 350M, Using lower than 1 values with -F is not effective and can lead to “GC loops” were the node would spend more time doing garbage collection than executing code.
-H3500M enables a dynamic allocation area until the amount of live data reaches 3500M, this helps limit the number of GC (small and major), especially at startup because less data will have to be promoted from generation 0 to generation 1 (since generation 0 GCs will be rarer).
-O3500M will delay a GC of the generation 1 until it reaches this amount of data. This setting, combined with -F1.1, limits the growth of the total heap size. Memory will keep increasing but very slowly (you can have a relay run for more than 30 days without problem)
With the above settings, the memory footprint stays very close to 8GB with a minimum amount of swapping. GC pauses are usually between 1s and 2s which is acceptable for a relay.
Note that, currently, if you restart a relay within the reward computation period it will not behave well and use a bigger memory footprint compared to a relay started before…
To further limit the memory usage you can increase the ledger snapshot interval as @weebl2000 suggested (12 hours is OK), this is important for very cheap relays such as Raspberry Pi because they are really slow and may end up becoming unresponsive for tens of second while they swap to execute their GC…
The important thing to monitor, with the -S setting, is the time spent doing GCs (the 5th column), ideally these delays should be less than a second (mandatory for a producer or you will miss a leader check every time it’s >1s)
Here I use a very big -O (and so I can use a -F < 1), since it has 16Gb of RAM, I use a bigger allocation area. I’ve also set a maximum heap (-M23G -c99 -Mgrace=1G) it’s not much because I fear the heap to reach this limit but because one of the side effects of setting -M is that it will trigger a major GC as soon as M/2 of live data is reached (11.5 GB in this case) which makes a predictable major GC and limit the potential growth of the heap,
Anyway those settings allow the node to run without doing a single major GC for approximately 11 hours or until the first ledger snapshot is made so for approximately 12 hours if the node is restarted after it did at least one ledger snapshot (snapshots are the files located in the ledger sub-directory of the database directory).
I’ve made it so my management software monitors the snapshots and knows when a block is due and will restart the node accordingly to make sure that lead blocks fall within this 11 hours window of tranquility.
It is said in the GHC 9.2.1 release notes that much improvements were done to the parallel GC which means that this missed slot problem may go away once the node will be buildable using GHC 9. IOG also says that they did much improvements to the ledger memory footprint in version 1.33 of cardano-node. So things are going to improve anyway.
I’ve made some other experiments with the non-moving GC (-xn) but, as I said before, this setting is not reliable and may crash the node or use an enormous amount of memory (especially when the parallel GC cannot keep up with the amount of garbage being produced).
In my previous posts I mentioned other experiments with more than 2 generations but these were not successful, the problem remains the same: the periodic ledger snapshots generate several gigabytes of garbage each time which will have to be collected eventually, there is no way around that, we can either delay the issue or mitigate it as much as possible but this is is really an optimization issue for the cardano-node developers to work on (which they may have done in version 1.33, but since it’s not released I haven’t tested yet).
My missed slot leader checks have been very high. I made the basic “+RTS -N - RTS run” change on my BP and will monitor during this epoch 310…
In the last 15 mins of epoch 307 my missed leader checks went from 164 to 467. Could not figure out what was going on. My BP was just upgraded to I9 16 core and 64 gb ram…
Everything you have been posting should help my situation…
Wait for the next release (1.33.0) it will use less RAM/CPU
From discord
The keen-eyed among you may have spotted a new node version (1.33.0) which we tagged today. I’d like to thank everyone in the community involved in testing the release candidates on the public testnet.
Our SRE team is slowly rolling this out to our own mainnet nodes.
HOWEVER please note this is NOT being recommended for general usage just yet so we are not officially releasing this on GitHubfor now. We are confident in the code, and benchmarking suggests significantly improved speed and performance (including a reduction in memory and much more consistent CPU usage during the reward calculation window).
However, issues can arise and since we won’t have the full DevOps /engineering teams working over the holiday period, we plan to make this an official release - and support it officially – from early January. Thanks for your understanding and patience on this! And have a very Merry Christmas and a Happy New Year!
The results of these settings and the research behind it are both extraordinary, and my explicit thanks is long overdue for all the work you’ve done on this issue this year
From what I see with testing on my hardware, the nonmoving-gc seems to not cause missed slot checks at all, until memory usage gets high. Whereas the copying gc results in 1-2 missed slots often, but not always, whenever it runs.
I can just let the block producer node run with the nonmoving-gc for a couple of days with no missed slot checks. However the system starts running slower presumably because the nonmoving-gc doesn’t manage the memory as well (memory use higher and maybe more fragmented?).
For example, I have been letting my Intel Xeon E-2276ML node with 2 vCPUs, 16Gb RAM, 16GB swap run for 3 days, without restart, since before the epoch transition. It only got missed slots during the epoch transition and none since.
However, it now seems to be running slowly because I just re-ran a leaderlog for current epoch and this caused 15 missed slot checks. Whereas running a leaderlog does not normally result in missed slot checks on my setup.
The node is running with “+RTS -C0 -N -I0 --nonmoving-gc -RTS”
Memory usage is high but stable:
Tue 21 Dec 2021 10:04:57 (0.5 days after start)
cardano-node +RTS -C0 -N -I0 --nonmoving-gc -RTS
total used free shared buff/cache available
Mem: 16393504 15851756 161420 20 380328 262232
Swap: 17039352 13271360 3767992
Thu 23 Dec 2021 12:44:11 (2.5 days after start)
total used free shared buff/cache available
Mem: 16393504 13827724 173360 20 2392420 2276652
Swap: 17039352 13417816 3621536
In summary: I think that running with nonmoving-gc means that ledger snapshots and haskell garbage collections don’t cause missed slots. However, the trade off is that memory usage is higher (and possibly more fragmented?) and this eventually can result in the node running slower and missing slot checks later if put under additional load.
By the way, running with nonmoving-gc does not result in crashes on my servers.