Investigating missed slots

QCPOLstakepool · 3 March 2021 02:34

Hello everyone,

Last week I started seeing my missed slots count increase in Grafana and I’m trying to figure out why. Right now, my core is missing a slot every 3-4 minutes. This seems to be a constant rate.

The core has 4 relays connected to/from in, 2 that are within the same LAN and 2 that are outside.
Each of the 4 relays have ± 20 peers (in and out)
All machine (relays & cores) are at ~5-6% CPU and 40% RAM
Nothing changed in the setup/configuration for weeks

The core logs:

Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.Forge:Info:118] [2021-03-03 02:16:39.10 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNodeNotLeader"),("slot",Number 2.3171506e7)]))]
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.LeadershipCheck:Info:118] [2021-03-03 02:16:39.10 UTC] {"credentials":"Cardano","kind":"TraceStartLeadershipCheck","delegMapSize":299892,"slot":23171508,"chainDensity":4.7476634e-2,"utxoSize":1423622}
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []

----- SLOT NUMBER 2.3171507e7 SHOULD BE SOMEWHERE HERE BUT IT'S NOT -----

Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.Forge:Info:118] [2021-03-03 02:16:39.10 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNodeNotLeader"),("slot",Number 2.3171508e7)]))]
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.LeadershipCheck:Info:118] [2021-03-03 02:16:40.00 UTC] {"credentials":"Cardano","kind":"TraceStartLeadershipCheck","delegMapSize":299892,"slot":23171509,"chainDensity":4.7476634e-2,"utxoSize":1423622}
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.Forge:Info:118] [2021-03-03 02:16:40.00 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNodeNotLeader"),("slot",Number 2.3171509e7)]))]

I’m trying to figure out what causes these missing slots.

Anyone experienced this?

QCPOLstakepool · 3 March 2021 15:45

If it can help, here’s a screenshot of my Grafana dashboard… All systems seem stable, but the missed slots count increases…

s01-r01 is 4 cores 6GB
s01-r02 is 8 cores 8GB
s01-c01 is 4 cores 4GB

pr63ice · 3 March 2021 18:07

Sorry, can’t really help but may I ask how you calculate the missed slots in Grafana? Thanks

QCPOLstakepool · 4 March 2021 00:15

It’s the the cardano_node_metrics_slotsMissedNum_int metric (only available when your node is running as core).

By the way, I fixed my problem. I reduced the number of relays connected to my block producer and I’m not missing slots!

pr63ice · 4 March 2021 17:33

Glad to hear!

Actually, I looked it up but there is no cardano_node_metrics_slotsMissedNum_int metric on my BP did you change the config file other than the TraceFetchBlockDecision parameter? Best wishes

QCPOLstakepool · 4 March 2021 21:15

I think the metric won’t show until you at least have 1 missed slots. I currently have none and I don’t see it when I curl localhost:12798/metrics (EKG metrics).

jf3110 · 14 April 2021 13:39

That’s true - the metric only shows up if at least one slot is missed.

tmvtmv · 31 May 2021 11:42

Hi, can you share what number of relays are you allowing now?
I’ve got the same issue with slotsMissed (my current settings is to allow 20 peers). Cheers!

QCPOLstakepool · 31 May 2021 17:51

Missed slots are only on a block produced node. You shouldn’t have 20 peers on a block producer.

I was using a raspberry pi as my block producer and reducing to only 1 peer helped, but after 20-24 hours I had missed slots again. I ended up using a more powerful server (dedicated 4 cores + 16GB RAM) and I currently have 3 peers connected to my block producer and I never miss a slot anymore.

tmvtmv · 31 May 2021 19:23

Thanks for sharing your config .
Sorry … I meant 20 peers to the relay nodes.
I already had the peers to my block producer limited to 3 (since we run 3 relay servers).

My Block Producer was running on 4 CPU / 16GB; I changed it to 8 CPU / 32GB but the results are the same.

I guess I’ll have to dig some more

Triton-pool · 1 June 2021 15:19

Good people out there. Just now I have 80 missedSlots on 19h35min runtime. My gut feeling is that is way too much, but I’d love to see how I am doing compared to others. Could you share your missed slot numbers for reference.

lgbtqstakepool · 1 June 2021 17:41

0 missed slots on almost 44h runtime. Restarted with the epoch switch. Before that 0 missed slots for 5 days.

8 cores/ 32 GB ECC ram / 512 GB nvme

jf3110 · 1 June 2021 21:09

IMO, it’s not so much the number, but the rate it increases over time once the first blocks are missed. It’s like the node is running into some critical state, where it’s no longer running without missing slots about every 30 min.

ATM, the problem goes away with the restart of the bp node, thus resetting from that critical state. As I said, I’ll add another (second) relay later this month. If that does not mitigate the situation, then IMO it makes sense to file an issue on github for cadrdano-node.

cardsfan7189 · 18 June 2021 01:23

And for 48 hours runtime, had 0 missed slots, until this morning. Then started getting 2 missed slots about every 2 or 3 hours. Two more missed slots a couple of minutes ago but load averages at 0 and memory usage looks normal (35%). Cloud metrics don’t show any spikes that coincide with the missed blocks. So, not an increasing rate of missed slots. Two relay nodes, bp running at 2 cpu 16gb ram. Is that enough, or is the true minimum system requirements higher than that?

QCPOLstakepool · 18 June 2021 02:08

I believe the recommended is still 16GB ram, but I would definitely try 4 core 16GB to see if you still have missed slots.

cardsfan7189 · 19 June 2021 18:08

Thanks @QCPOLstakepool. I’ll give that a try. However not sure if I want to pay $120 per month for the cloud service, but will be interesting to see if it improves. I did turn tracemempool off and restarted. In the 16 hours since doing that and the restart, 6 missed slots.

Has anyone seen Network delay calculation based on logs ? Interesting script for getting delay metrics

Alexd1985 · 19 June 2021 19:32

cardsfan7189 · 19 June 2021 21:32

Thanks. I’ll look into contabo. I’m currently running dedicated servers on DigitalOcean. I was influenced by https://docs.cardano.org/getting-started/guidelines-for-large-spos even though I’m a very small SPO, so I thought my nodes should not be on VPS. However, if a VPS at 8 vCPU/30 RAM is just as good if not better than a dedicated 4 vCPU/16 RAM, then it’s time to switch.

Alexd1985 · 19 June 2021 21:45

Should be enough for one server, use it if u have it!

cardsfan7189 · 21 June 2021 00:45

Except for a 3-minute anomaly with a relay node following a topology updater update (140 missed slots) no missed slots since the 4 CPU resize. Over 24 hours have passed. Will keep monitoring. @Alexd1985 does your core node run on a Contabo VPS? Any issues with missed slots/blocks?

Topic		Replies	Views
About missed slots Setup a Stake Pool	15	1168	7 May 2021
SlotsMissedNum_int Operate a Stake Pool	4	711	30 December 2021
Debugging node question cardano_node_metrics_slotsMissedNum_int Operate a Stake Pool	22	1639	30 April 2021
Missed two blocks - no clue why Operate a Stake Pool	9	965	18 April 2021
Missed slots Operate a Stake Pool	59	2602	8 October 2021

Investigating missed slots

Related topics