Honestly I was just trying it out. The documentation states:
However, it is sometimes beneficial to enable the parallel GC for a single-threaded sequential program too, especially if the program has a large amount of heap data and GC is a significant fraction of runtime. To use the parallel GC in a sequential program, enable the parallel runtime with a suitable [ -N ⟨x⟩ ](5.4. Using Concurrent Haskell — Glasgow Haskell Compiler 9.8.1 User's Guide ⟨x⟩) option, and additionally it might be beneficial to restrict parallel GC to the old generation with `-qg1 .
Even though I don’t know the internals, I was under the impression that due to it’s sequential nature, cardano-node is not heavily threaded (,yet?) - which is why I thought may be -qg1 could be benficial here? If you have contradicting information, you’re welcome to share it.
Some parts of the node are not multithreaded because the computation are sequential indeed but it does use several threads. It would be interesting to have a diagram of those threads. What I could observe though, is that the computation happening at the epoch transition is always using a single core at 100% and is therefore not multithreaded. On the contrary there are periods in the epoch (I guess its when the rewards are computed) when you can see that the node is using all the configured cores at 100%.
It would be interesting to observe the GC’s behavior during the epoch transition.
I observed that the prometheus metric cardano_node_metrics_slotsMissedNum_int is not always correct, so have written my own (dirty) script. It might help to find the real count of missed slots -
I get an error
./slotmissed.sh: line 7: ((: num = : syntax error: operand expected (error token is "= ")
./slotmissed.sh: line 8: ((: last = : syntax error: operand expected (error token is "= ")
The average missed slots which we observed is ~250 Slots / Epoch which increases the risk by 2.5x
Hi there, do you mean ~250 during epoch? Or ~250 at epoch transition can also be normal? I’m a new SPO with very low stake. I have no missed slots during epoch but from yesterday to today I had ~200 missed slots at same timeframe, exactly at epoch transition… is it normal? Thanks!
I might have another factor/theory that could also cause missed slots at hosted VPS’s.
Did you check if you have multiple snapshots running? snapshots are bad for performance and might cause missed slots as well. I can’t test this, since I don’t have snapshots, but I thought maybe it could be tested by others and we could learn something new Cheers.
The missed slots are mostly caused by the garbage collector. You can check that by playing with the RTS options to configure the settings of the garbage collector and/or limit the memory available to cardano-node below 8GB (e.g. -M7G)
The defaults work pretty well in 1.29.0, but making all core available and switching to the newer non-moving mark and sweep GC algorithm reduces the missed slots significantly:
export GHCRTS=’-N -T -I0 -A16m --disable-delayed-os-memory-return --nonmoving-gc’
Other things that help moderately:
Disable the mempool trace
Raise the minimum CPU frequency if you are using the ondemand governor (which normally is the default)
Forward the socket of cardano-node via ssh to a different machine and run cardano-cli stake snapshot + cncli leaderlog there
Is there a way to tell if the -threaded option is set already? I followed the CoinCashew guide. I’ve already added the +RTS -N -RTS options into my startup script, but I want to make sure it’s actually doing something. I did see my number of missed slots go down quite a bit when I moved from 2 vCPU to 4 vCPU, so that makes me think that the settings are working, but I’d like to verify for sure since it seems that many people have gone to 0 missed slot leader checks after making these changes.
Please clarify something in my mind. Your script only displays the missed slot checks or does it list the actual slot assigned to the node and missed the minting opportunity for some reason?
I used your script to day and I got this:
It only checks the missed slots. To find out if you actually missed a leader slot, cross check the timestamp.
The script helps to find issues upfront so that slots are not missed and in your case do check why 73 slots were missed…probably you restarted the node
Thanks for sharing this. The most important parameter in your settings is probably the -xn one which makes the RTS use the new non-moving garbage collector except that every time I tried this the node would crash after about 12h to 2 days of running (various kind of crashes, segmentation faults or RTS internal errors resulting in an abort).
Is your node stable?
the -O parameter is interesting, it would reduce the number of major GCs at startup until it reaches this size of live data. This probably improves the node’s startup time and may also prevent the -Iw parameter from triggering useless GC when not necessary!
I’m not sure about this one, have you tried without and seen a major difference on the missed slot counter? (note that you’d want to do your test at the worst time possible for the node: while the rewards are computed)