No more missed slots (during epoch) after changing one setting

hanswurst · 8 August 2021 09:30

Honestly I was just trying it out. The documentation states:

However, it is sometimes beneficial to enable the parallel GC for a single-threaded sequential program too, especially if the program has a large amount of heap data and GC is a significant fraction of runtime. To use the parallel GC in a sequential program, enable the parallel runtime with a suitable [ -N ⟨x⟩ ](5.4. Using Concurrent Haskell — Glasgow Haskell Compiler 9.8.1 User's Guide ⟨x⟩) option, and additionally it might be beneficial to restrict parallel GC to the old generation with `-qg1 .

Even though I don’t know the internals, I was under the impression that due to it’s sequential nature, cardano-node is not heavily threaded (,yet?) - which is why I thought may be -qg1 could be benficial here? If you have contradicting information, you’re welcome to share it.

In any case, the resulst works as expected

_2072 · 8 August 2021 14:43

Some parts of the node are not multithreaded because the computation are sequential indeed but it does use several threads. It would be interesting to have a diagram of those threads. What I could observe though, is that the computation happening at the epoch transition is always using a single core at 100% and is therefore not multithreaded. On the contrary there are periods in the epoch (I guess its when the rewards are computed) when you can see that the node is using all the configured cores at 100%.
It would be interesting to observe the GC’s behavior during the epoch transition.

SNSKY · 8 August 2021 15:37

I observed that the prometheus metric cardano_node_metrics_slotsMissedNum_int is not always correct, so have written my own (dirty) script. It might help to find the real count of missed slots -

missedSlot

github.com

sandy4de/SNSKY/blob/main/missedSlots.sh

#!/bin/bash
sudo zgrep -h "TraceStartLeadershipCheck" /var/log/syslog | awk '{print $3, $(NF)}' > missedSlots
sed 's/[:,] */ /g' missedSlots > missedSlots1
awk '{if (NF == 9) print $1":"$2":"$3, $(NF-2); else print $1":"$2":"$3, $(NF-6);}' missedSlots1 > missedSlots
rm missedSlots1

((num = $(awk 'NR==1{print $NF}' missedSlots)))
((last = $(awk 'END{print $NF}' missedSlots)))
((first = num))
((miss = 0))
((hit = 0))
echo "firstSlot: $num"
echo "lastSlot:  $last"
echo ""
echo "miss: TimeStamp      SlotExpected       SlotSeen      SlotsMissed"
echo "================================================================="

while read -r line1 line2 remainder
 do
        if [[  "$num" -eq "$line2"  ]]

This file has been truncated. show original

icq2006 · 12 August 2021 16:17

I get an error
./slotmissed.sh: line 7: ((: num = : syntax error: operand expected (error token is "= ")
./slotmissed.sh: line 8: ((: last = : syntax error: operand expected (error token is "= ")

SNSKY · 12 August 2021 17:22

Check whether missedSlots & missedSlot1 are empty? If they are empty, then probably trace is not enabled or you are running the script on a relay?

icq2006 · 12 August 2021 17:46

Tracemempool is OFF at my BP

missedSlot is empty
and no “Trace” found in the log file

SNSKY · 12 August 2021 18:56

The script parses the log file and so trace needs to be enabled. Use default trace settings in mainnet-config …then it should work

lukeluke · 19 August 2021 15:40

The average missed slots which we observed is ~250 Slots / Epoch which increases the risk by 2.5x

Hi there, do you mean ~250 during epoch? Or ~250 at epoch transition can also be normal? I’m a new SPO with very low stake. I have no missed slots during epoch but from yesterday to today I had ~200 missed slots at same timeframe, exactly at epoch transition… is it normal? Thanks!

jf3110 · 19 August 2021 18:12

This (missing slots during epoch changes) is normal with current node software.

mcrio · 20 August 2021 21:24

Hi, looking at the cardano-node cabal config, those multicore related options seem to be already set?

github.com

input-output-hk/cardano-node/blob/master/cardano-node/cardano-node.cabal#L157

    
      
                                , tracer-transformers
                                , transformers
                                , transformers-except
                                , unordered-containers
                                , yaml
          
          
executable cardano-node
            import:               base, project-config
            hs-source-dirs:       app
            main-is:              cardano-node.hs
            ghc-options:          -threaded
                                  -rtsopts
          
          
  if arch(arm)
              ghc-options:        "-with-rtsopts=-T -I0 -A16m -N1 --disable-delayed-os-memory-return"
            else
              ghc-options:        "-with-rtsopts=-T -I0 -A16m -N2 --disable-delayed-os-memory-return"
          
          
  other-modules:        Paths_cardano_node
          
          
  build-depends:        cardano-config

brouwerQ · 21 August 2021 10:17

Yes correct.

ADA4Good · 31 August 2021 23:36

I might have another factor/theory that could also cause missed slots at hosted VPS’s.

Did you check if you have multiple snapshots running? snapshots are bad for performance and might cause missed slots as well. I can’t test this, since I don’t have snapshots, but I thought maybe it could be tested by others and we could learn something new Cheers.

orpheus-ant · 18 September 2021 07:34

The missed slots are mostly caused by the garbage collector. You can check that by playing with the RTS options to configure the settings of the garbage collector and/or limit the memory available to cardano-node below 8GB (e.g. -M7G)

The defaults work pretty well in 1.29.0, but making all core available and switching to the newer non-moving mark and sweep GC algorithm reduces the missed slots significantly:
export GHCRTS=’-N -T -I0 -A16m --disable-delayed-os-memory-return --nonmoving-gc’

Other things that help moderately:

Disable the mempool trace
Raise the minimum CPU frequency if you are using the ondemand governor (which normally is the default)
Forward the socket of cardano-node via ssh to a different machine and run cardano-cli stake snapshot + cncli leaderlog there

LongRib78 · 6 October 2021 17:13

Is there a way to tell if the -threaded option is set already? I followed the CoinCashew guide. I’ve already added the +RTS -N -RTS options into my startup script, but I want to make sure it’s actually doing something. I did see my number of missed slots go down quite a bit when I moved from 2 vCPU to 4 vCPU, so that makes me think that the settings are working, but I’d like to verify for sure since it seems that many people have gone to 0 missed slot leader checks after making these changes.

mcrio · 6 October 2021 18:47

You can check the source code:

github.com

input-output-hk/cardano-node/blob/1f3033298085cdf417448ab57ce22b5992496b36/cardano-node/cardano-node.cabal#L158

    
      
                                , tracer-transformers
                                , transformers
                                , transformers-except
                                , unordered-containers
                                , yaml
          
          
executable cardano-node
            import:               base, project-config
            hs-source-dirs:       app
            main-is:              cardano-node.hs
            ghc-options:          -threaded
                                  -rtsopts
          
          
  if arch(arm)
              ghc-options:        "-with-rtsopts=-T -I0 -A16m -N1 --disable-delayed-os-memory-return"
            else
              ghc-options:        "-with-rtsopts=-T -I0 -A16m -N2 --disable-delayed-os-memory-return"
          
          
  other-modules:        Paths_cardano_node
          
          
  build-depends:        cardano-config

Sidou · 24 October 2021 15:33

Please clarify something in my mind. Your script only displays the missed slot checks or does it list the actual slot assigned to the node and missed the minting opportunity for some reason?
I used your script to day and I got this:

What does it mean?

SNSKY · 24 October 2021 19:46

It only checks the missed slots. To find out if you actually missed a leader slot, cross check the timestamp.
The script helps to find issues upfront so that slots are not missed and in your case do check why 73 slots were missed…probably you restarted the node

Sidou · 24 October 2021 21:00

Yes indeed, that’s the exact time when I restarted the node.
Thanks a lot for your great contributions to the community.

weebl2000 · 19 November 2021 12:28

Hey guys,

I’ve been playing around with RTS params A LOT. I think the following should prevent any missed slots if you HW is good enough:

+RTS -N(num of PHYSICAL CPUs) -xn -G2 -T -S -A128m -n1m -C0 -I0.3 -Iw3600 -O4000m -RTS

I think the magic is in -C0 - this tells GHC to do as MANY context switches as it can and not wait the default 20ms before trying to context switch.

-O4000m means don’t do old GC before it reaches 4000 MB. -n1m means split up chunks in heap to 1 MB

Good luck.

_2072 · 21 November 2021 16:01

Thanks for sharing this. The most important parameter in your settings is probably the -xn one which makes the RTS use the new non-moving garbage collector except that every time I tried this the node would crash after about 12h to 2 days of running (various kind of crashes, segmentation faults or RTS internal errors resulting in an abort).
Is your node stable?

the -O parameter is interesting, it would reduce the number of major GCs at startup until it reaches this size of live data. This probably improves the node’s startup time and may also prevent the -Iw parameter from triggering useless GC when not necessary!

About the -C parameter:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/using-concurrent.html

I’m not sure about this one, have you tried without and seen a major difference on the missed slot counter? (note that you’d want to do your test at the worst time possible for the node: while the rewards are computed)

Topic		Replies	Views
1.27 missed slot epoch change over Operate a Stake Pool	4	509	21 May 2021
Too many Missed Slots happens, help me Operate a Stake Pool	72	4162	2 October 2022
Does more cores = less missed slots? Operate a Stake Pool	1	311	18 April 2022
Missed slot leader checks increased Operate a Stake Pool	1	543	30 October 2021
Missed slots Operate a Stake Pool	59	2470	8 October 2021

No more missed slots (during epoch) after changing one setting

Related Topics