Investigating missed slots

Hello everyone,

Last week I started seeing my missed slots count increase in Grafana and I’m trying to figure out why. Right now, my core is missing a slot every 3-4 minutes. This seems to be a constant rate.

  1. The core has 4 relays connected to/from in, 2 that are within the same LAN and 2 that are outside.
  2. Each of the 4 relays have ± 20 peers (in and out)
  3. All machine (relays & cores) are at ~5-6% CPU and 40% RAM
  4. Nothing changed in the setup/configuration for weeks

The core logs:

Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.Forge:Info:118] [2021-03-03 02:16:39.10 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNodeNotLeader"),("slot",Number 2.3171506e7)]))]
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.LeadershipCheck:Info:118] [2021-03-03 02:16:39.10 UTC] {"credentials":"Cardano","kind":"TraceStartLeadershipCheck","delegMapSize":299892,"slot":23171508,"chainDensity":4.7476634e-2,"utxoSize":1423622}
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []

----- SLOT NUMBER 2.3171507e7 SHOULD BE SOMEWHERE HERE BUT IT'S NOT -----

Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.Forge:Info:118] [2021-03-03 02:16:39.10 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNodeNotLeader"),("slot",Number 2.3171508e7)]))]
Mar 03 02:16:39 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:39.10 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.LeadershipCheck:Info:118] [2021-03-03 02:16:40.00 UTC] {"credentials":"Cardano","kind":"TraceStartLeadershipCheck","delegMapSize":299892,"slot":23171509,"chainDensity":4.7476634e-2,"utxoSize":1423622}
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.ForgeTime:Info:118] [2021-03-03 02:16:40.00 UTC] fromList []
Mar 03 02:16:40 cardano-mainnet-s01-c01 bash[166580]: [cardano-:cardano.node.Forge:Info:118] [2021-03-03 02:16:40.00 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNodeNotLeader"),("slot",Number 2.3171509e7)]))]

I’m trying to figure out what causes these missing slots.

Anyone experienced this?

1 Like

If it can help, here’s a screenshot of my Grafana dashboard… All systems seem stable, but the missed slots count increases…

s01-r01 is 4 cores 6GB
s01-r02 is 8 cores 8GB
s01-c01 is 4 cores 4GB

Sorry, can’t really help but may I ask how you calculate the missed slots in Grafana? Thanks

It’s the the cardano_node_metrics_slotsMissedNum_int metric (only available when your node is running as core).

By the way, I fixed my problem. I reduced the number of relays connected to my block producer and I’m not missing slots!

Glad to hear!

Actually, I looked it up but there is no cardano_node_metrics_slotsMissedNum_int metric on my BP did you change the config file other than the TraceFetchBlockDecision parameter? Best wishes

I think the metric won’t show until you at least have 1 missed slots. I currently have none and I don’t see it when I curl localhost:12798/metrics (EKG metrics).

1 Like

That’s true - the metric only shows up if at least one slot is missed.

Hi, can you share what number of relays are you allowing now?
I’ve got the same issue with slotsMissed (my current settings is to allow 20 peers). Cheers!

Missed slots are only on a block produced node. You shouldn’t have 20 peers on a block producer.

I was using a raspberry pi as my block producer and reducing to only 1 peer helped, but after 20-24 hours I had missed slots again. I ended up using a more powerful server (dedicated 4 cores + 16GB RAM) and I currently have 3 peers connected to my block producer and I never miss a slot anymore.

Thanks for sharing your config .
Sorry … I meant 20 peers to the relay nodes.
I already had the peers to my block producer limited to 3 (since we run 3 relay servers).

My Block Producer was running on 4 CPU / 16GB; I changed it to 8 CPU / 32GB but the results are the same.

I guess I’ll have to dig some more :wink:

Good people out there. Just now I have 80 missedSlots on 19h35min runtime. My gut feeling is that is way too much, but I’d love to see how I am doing compared to others. Could you share your missed slot numbers for reference.

0 missed slots on almost 44h runtime. Restarted with the epoch switch. Before that 0 missed slots for 5 days.

8 cores/ 32 GB ECC ram / 512 GB nvme

1 Like

IMO, it’s not so much the number, but the rate it increases over time once the first blocks are missed. It’s like the node is running into some critical state, where it’s no longer running without missing slots about every 30 min.

ATM, the problem goes away with the restart of the bp node, thus resetting from that critical state. As I said, I’ll add another (second) relay later this month. If that does not mitigate the situation, then IMO it makes sense to file an issue on github for cadrdano-node.

And for 48 hours runtime, had 0 missed slots, until this morning. Then started getting 2 missed slots about every 2 or 3 hours. Two more missed slots a couple of minutes ago but load averages at 0 and memory usage looks normal (35%). Cloud metrics don’t show any spikes that coincide with the missed blocks. So, not an increasing rate of missed slots. Two relay nodes, bp running at 2 cpu 16gb ram. Is that enough, or is the true minimum system requirements higher than that?

I believe the recommended is still 16GB ram, but I would definitely try 4 core 16GB to see if you still have missed slots.

Thanks @QCPOLstakepool. I’ll give that a try. However not sure if I want to pay $120 per month for the cloud service, but will be interesting to see if it improves. I did turn tracemempool off and restarted. In the 16 hours since doing that and the restart, 6 missed slots.

Has anyone seen Network delay calculation based on logs ? Interesting script for getting delay metrics

image
image

Thanks. I’ll look into contabo. I’m currently running dedicated servers on DigitalOcean. I was influenced by https://docs.cardano.org/getting-started/guidelines-for-large-spos even though I’m a very small SPO, so I thought my nodes should not be on VPS. However, if a VPS at 8 vCPU/30 RAM is just as good if not better than a dedicated 4 vCPU/16 RAM, then it’s time to switch.

Should be enough for one server, use it if u have it!

Except for a 3-minute anomaly with a relay node following a topology updater update (140 missed slots) no missed slots since the 4 CPU resize. Over 24 hours have passed. Will keep monitoring. @Alexd1985 does your core node run on a Contabo VPS? Any issues with missed slots/blocks?