No more missed slots (during epoch) after changing one setting

Honestly I was just trying it out. The documentation states:

However, it is sometimes beneficial to enable the parallel GC for a single-threaded sequential program too, especially if the program has a large amount of heap data and GC is a significant fraction of runtime. To use the parallel GC in a sequential program, enable the parallel runtime with a suitable [ -N ⟨x⟩ ](5.4. Using Concurrent Haskell — Glasgow Haskell Compiler 9.8.1 User's Guide ⟨x⟩) option, and additionally it might be beneficial to restrict parallel GC to the old generation with `-qg1 .

Even though I don’t know the internals, I was under the impression that due to it’s sequential nature, cardano-node is not heavily threaded (,yet?) - which is why I thought may be -qg1 could be benficial here? If you have contradicting information, you’re welcome to share it.

In any case, the resulst works as expected :slight_smile:

Some parts of the node are not multithreaded because the computation are sequential indeed but it does use several threads. It would be interesting to have a diagram of those threads. What I could observe though, is that the computation happening at the epoch transition is always using a single core at 100% and is therefore not multithreaded. On the contrary there are periods in the epoch (I guess its when the rewards are computed) when you can see that the node is using all the configured cores at 100%.
It would be interesting to observe the GC’s behavior during the epoch transition.

I observed that the prometheus metric cardano_node_metrics_slotsMissedNum_int is not always correct, so have written my own (dirty) script. It might help to find the real count of missed slots -

missedSlot

4 Likes

I get an error
./slotmissed.sh: line 7: ((: num = : syntax error: operand expected (error token is "= ")
./slotmissed.sh: line 8: ((: last = : syntax error: operand expected (error token is "= ")

Check whether missedSlots & missedSlot1 are empty? If they are empty, then probably trace is not enabled or you are running the script on a relay?

Tracemempool is OFF at my BP

missedSlot is empty
and no “Trace” found in the log file

The script parses the log file and so trace needs to be enabled. Use default trace settings in mainnet-config …then it should work

The average missed slots which we observed is ~250 Slots / Epoch which increases the risk by 2.5x

Hi there, do you mean ~250 during epoch? Or ~250 at epoch transition can also be normal? I’m a new SPO with very low stake. I have no missed slots during epoch but from yesterday to today I had ~200 missed slots at same timeframe, exactly at epoch transition… is it normal? Thanks!

This (missing slots during epoch changes) is normal with current node software.

Hi, looking at the cardano-node cabal config, those multicore related options seem to be already set?

Yes correct.

I might have another factor/theory that could also cause missed slots at hosted VPS’s.

Did you check if you have multiple snapshots running? snapshots are bad for performance and might cause missed slots as well. I can’t test this, since I don’t have snapshots, but I thought maybe it could be tested by others and we could learn something new :slight_smile: Cheers.

1 Like

The missed slots are mostly caused by the garbage collector. You can check that by playing with the RTS options to configure the settings of the garbage collector and/or limit the memory available to cardano-node below 8GB (e.g. -M7G)

The defaults work pretty well in 1.29.0, but making all core available and switching to the newer non-moving mark and sweep GC algorithm reduces the missed slots significantly:
export GHCRTS=’-N -T -I0 -A16m --disable-delayed-os-memory-return --nonmoving-gc’

Other things that help moderately:

  1. Disable the mempool trace
  2. Raise the minimum CPU frequency if you are using the ondemand governor (which normally is the default)
  3. Forward the socket of cardano-node via ssh to a different machine and run cardano-cli stake snapshot + cncli leaderlog there
2 Likes

Is there a way to tell if the -threaded option is set already? I followed the CoinCashew guide. I’ve already added the +RTS -N -RTS options into my startup script, but I want to make sure it’s actually doing something. I did see my number of missed slots go down quite a bit when I moved from 2 vCPU to 4 vCPU, so that makes me think that the settings are working, but I’d like to verify for sure since it seems that many people have gone to 0 missed slot leader checks after making these changes.

You can check the source code:

Please clarify something in my mind. Your script only displays the missed slot checks or does it list the actual slot assigned to the node and missed the minting opportunity for some reason?
I used your script to day and I got this:
image
What does it mean?

It only checks the missed slots. To find out if you actually missed a leader slot, cross check the timestamp.
The script helps to find issues upfront so that slots are not missed and in your case do check why 73 slots were missed…probably you restarted the node

Yes indeed, that’s the exact time when I restarted the node.
Thanks a lot for your great contributions to the community.

1 Like

Hey guys,

I’ve been playing around with RTS params A LOT. I think the following should prevent any missed slots if you HW is good enough:

+RTS -N(num of PHYSICAL CPUs) -xn -G2 -T -S -A128m -n1m -C0 -I0.3 -Iw3600 -O4000m -RTS

I think the magic is in -C0 - this tells GHC to do as MANY context switches as it can and not wait the default 20ms before trying to context switch.

-O4000m means don’t do old GC before it reaches 4000 MB. -n1m means split up chunks in heap to 1 MB

Good luck.

4 Likes

Thanks for sharing this. The most important parameter in your settings is probably the -xn one which makes the RTS use the new non-moving garbage collector except that every time I tried this the node would crash after about 12h to 2 days of running (various kind of crashes, segmentation faults or RTS internal errors resulting in an abort).
Is your node stable?

the -O parameter is interesting, it would reduce the number of major GCs at startup until it reaches this size of live data. This probably improves the node’s startup time and may also prevent the -Iw parameter from triggering useless GC when not necessary!

About the -C parameter:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/using-concurrent.html

I’m not sure about this one, have you tried without and seen a major difference on the missed slot counter? (note that you’d want to do your test at the worst time possible for the node: while the rewards are computed)