Thanks for the extraordinary research.
From what I see with testing on my hardware, the nonmoving-gc seems to not cause missed slot checks at all, until memory usage gets high. Whereas the copying gc results in 1-2 missed slots often, but not always, whenever it runs.
I can just let the block producer node run with the nonmoving-gc for a couple of days with no missed slot checks. However the system starts running slower presumably because the nonmoving-gc doesn’t manage the memory as well (memory use higher and maybe more fragmented?).
For example, I have been letting my Intel Xeon E-2276ML node with 2 vCPUs, 16Gb RAM, 16GB swap run for 3 days, without restart, since before the epoch transition. It only got missed slots during the epoch transition and none since.
However, it now seems to be running slowly because I just re-ran a leaderlog for current epoch and this caused 15 missed slot checks. Whereas running a leaderlog does not normally result in missed slot checks on my setup.
The node is running with “+RTS -C0 -N -I0 --nonmoving-gc -RTS”
Memory usage is high but stable:
Tue 21 Dec 2021 10:04:57 (0.5 days after start)
cardano-node +RTS -C0 -N -I0 --nonmoving-gc -RTS
total used free shared buff/cache available
Mem: 16393504 15851756 161420 20 380328 262232
Swap: 17039352 13271360 3767992
Thu 23 Dec 2021 12:44:11 (2.5 days after start)
total used free shared buff/cache available
Mem: 16393504 13827724 173360 20 2392420 2276652
Swap: 17039352 13417816 3621536
In summary: I think that running with nonmoving-gc means that ledger snapshots and haskell garbage collections don’t cause missed slots. However, the trade off is that memory usage is higher (and possibly more fragmented?) and this eventually can result in the node running slower and missing slot checks later if put under additional load.
By the way, running with nonmoving-gc does not result in crashes on my servers.