Investigating missed slots

It’s the the cardano_node_metrics_slotsMissedNum_int metric (only available when your node is running as core).

By the way, I fixed my problem. I reduced the number of relays connected to my block producer and I’m not missing slots!

Glad to hear!

Actually, I looked it up but there is no cardano_node_metrics_slotsMissedNum_int metric on my BP did you change the config file other than the TraceFetchBlockDecision parameter? Best wishes

I think the metric won’t show until you at least have 1 missed slots. I currently have none and I don’t see it when I curl localhost:12798/metrics (EKG metrics).

1 Like

That’s true - the metric only shows up if at least one slot is missed.

Hi, can you share what number of relays are you allowing now?
I’ve got the same issue with slotsMissed (my current settings is to allow 20 peers). Cheers!

Missed slots are only on a block produced node. You shouldn’t have 20 peers on a block producer.

I was using a raspberry pi as my block producer and reducing to only 1 peer helped, but after 20-24 hours I had missed slots again. I ended up using a more powerful server (dedicated 4 cores + 16GB RAM) and I currently have 3 peers connected to my block producer and I never miss a slot anymore.

Thanks for sharing your config .
Sorry … I meant 20 peers to the relay nodes.
I already had the peers to my block producer limited to 3 (since we run 3 relay servers).

My Block Producer was running on 4 CPU / 16GB; I changed it to 8 CPU / 32GB but the results are the same.

I guess I’ll have to dig some more :wink:

Good people out there. Just now I have 80 missedSlots on 19h35min runtime. My gut feeling is that is way too much, but I’d love to see how I am doing compared to others. Could you share your missed slot numbers for reference.

0 missed slots on almost 44h runtime. Restarted with the epoch switch. Before that 0 missed slots for 5 days.

8 cores/ 32 GB ECC ram / 512 GB nvme

1 Like

IMO, it’s not so much the number, but the rate it increases over time once the first blocks are missed. It’s like the node is running into some critical state, where it’s no longer running without missing slots about every 30 min.

ATM, the problem goes away with the restart of the bp node, thus resetting from that critical state. As I said, I’ll add another (second) relay later this month. If that does not mitigate the situation, then IMO it makes sense to file an issue on github for cadrdano-node.

And for 48 hours runtime, had 0 missed slots, until this morning. Then started getting 2 missed slots about every 2 or 3 hours. Two more missed slots a couple of minutes ago but load averages at 0 and memory usage looks normal (35%). Cloud metrics don’t show any spikes that coincide with the missed blocks. So, not an increasing rate of missed slots. Two relay nodes, bp running at 2 cpu 16gb ram. Is that enough, or is the true minimum system requirements higher than that?

I believe the recommended is still 16GB ram, but I would definitely try 4 core 16GB to see if you still have missed slots.

Thanks @QCPOLstakepool. I’ll give that a try. However not sure if I want to pay $120 per month for the cloud service, but will be interesting to see if it improves. I did turn tracemempool off and restarted. In the 16 hours since doing that and the restart, 6 missed slots.

Has anyone seen Network delay calculation based on logs ? Interesting script for getting delay metrics

image
image

Thanks. I’ll look into contabo. I’m currently running dedicated servers on DigitalOcean. I was influenced by https://docs.cardano.org/getting-started/guidelines-for-large-spos even though I’m a very small SPO, so I thought my nodes should not be on VPS. However, if a VPS at 8 vCPU/30 RAM is just as good if not better than a dedicated 4 vCPU/16 RAM, then it’s time to switch.

Should be enough for one server, use it if u have it!

Except for a 3-minute anomaly with a relay node following a topology updater update (140 missed slots) no missed slots since the 4 CPU resize. Over 24 hours have passed. Will keep monitoring. @Alexd1985 does your core node run on a Contabo VPS? Any issues with missed slots/blocks?

Rest of Sunday and Monday went fine, no missled slots, but then today four missed slots. Still running on 4 cpu/16GB ram. Found these log entries at about the two times the slots were missed.

[bp-node:cardano.node.IpSubscription:Error:2086] [2021-06-22 13:22:57.58 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Application Exception: 143.110.
219.143:8001 ExceededTimeLimit (ChainSync (Header (HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Al
legra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (Tip HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra Standa
rdCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (ServerAgency TokNext TokM
ustReply)
[bp-node:cardano.node.ErrorPolicy:Notice:83] [2021-06-22 13:22:57.58 UTC] IP 143.110.219.143:8001 ErrorPolicySuspendConsumer (Just (ApplicationExceptionTrace ExceededTi
meLimit (ChainSync (Header (HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto))
(’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (Tip HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (Shell
eyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (ServerAgency TokNext TokMustReply))) 20s
[bp-node:cardano.node.IpSubscription:Notice:87] [2021-06-22 13:22:58.58 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Waiting 0.025s before attempting
a new connection
[bp-node:cardano.node.IpSubscription:Notice:12641] [2021-06-22 13:22:58.59 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Connection Attempt End, desti
nation 143.110.219.143:8001 outcome: ConnectSuccessLast
[bp-node:cardano.node.ErrorPolicy:Warning:91] [2021-06-22 13:22:58.63 UTC] IP 143.110.219.143:44883 ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError Mu
xBearerClosed “<socket: 43> closed when reading data, waiting on next header True”))) 20s 20s
[bp-node:cardano.node.ErrorPolicy:Warning:91] [2021-06-22 13:22:58.80 UTC] IP 165.232.132.156:42661 ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError Mu
xBearerClosed “<socket: 27> closed when reading data, waiting on next header True”))) 20s 20s
[bp-node:cardano.node.IpSubscription:Error:4185] [2021-06-22 13:23:42.62 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Application Exception: 165.232.
132.156:8001 ExceededTimeLimit (ChainSync (Header (HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Al
legra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (Tip HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra Standa
rdCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (ServerAgency TokNext TokM
ustReply)
[bp-node:cardano.node.ErrorPolicy:Notice:83] [2021-06-22 13:23:42.62 UTC] IP 165.232.132.156:8001 ErrorPolicySuspendConsumer (Just (ApplicationExceptionTrace ExceededTi
meLimit (ChainSync (Header (HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto))
(’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (Tip HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (Shell
eyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (ServerAgency TokNext TokMustReply))) 20s
[bp-node:cardano.node.IpSubscription:Error:87] [2021-06-22 13:23:43.62 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Failed to start all required subs
criptions
[bp-node:cardano.node.IpSubscription:Notice:87] [2021-06-22 13:23:53.62 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Waiting 0.025s before attempting
a new connection
[bp-node:cardano.node.IpSubscription:Notice:12685] [2021-06-22 13:23:53.70 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Connection Attempt End, desti
nation 165.232.132.156:8001 outcome: ConnectSuccessLast

[bp-node:cardano.node.IpSubscription:Error:12685] [2021-06-22 14:15:36.50 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Application Exception: 165.232
.132.156:8001 ExceededTimeLimit (ChainSync (Header (HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘A
llegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (Tip HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra Stand
ardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (ServerAgency TokNext Tok
MustReply)
[bp-node:cardano.node.ErrorPolicy:Notice:83] [2021-06-22 14:15:36.50 UTC] IP 165.232.132.156:8001 ErrorPolicySuspendConsumer (Just (ApplicationExceptionTrace ExceededTi
meLimit (ChainSync (Header (HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto))
(’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (Tip HardForkBlock (’: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (Shell
eyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (ServerAgency TokNext TokMustReply))) 20s
[bp-node:cardano.node.IpSubscription:Notice:87] [2021-06-22 14:15:37.50 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Waiting 0.025s before attempting
a new connection
[bp-node:cardano.node.IpSubscription:Notice:12894] [2021-06-22 14:15:37.58 UTC] IPs: 0.0.0.0:0 [165.232.132.156:8001,143.110.219.143:8001] Connection Attempt End, desti
nation 165.232.132.156:8001 outcome: ConnectSuccessLast
[bp-node:cardano.node.ErrorPolicy:Warning:91] [2021-06-22 14:16:22.46 UTC] IP 143.110.219.143:33441 ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError Mu
xBearerClosed “<socket: 29> closed when reading data, waiting on next header True”))) 20s 20s

Anything in the relay logs I need to look at ?

Just to share what I have discovered over the past month of monitoring this problem. I have also been having missed slots showing up on Grafana ever since I started using Grafana reports about 2 months back. I was hoping that the issue was a configuration problem, because if it’s not, that would mean a hardware upgrade, which is additional costs. I was running 2 relays and 1 BP on 4 x vCores and 8GB RAM, for each instance.

About a month back, with the increasing load on the Cardano network and upcoming 1.29.0 Alonzo hard fork, I decided to try out a hardware upgrade since none of the config changes made a difference. Note that I was getting 2 to 6 missed slots every 2 to 4 hours. It was fairly consistent and somewhat unacceptable to me. I started with upgrading one of my relay servers to 8 cores and 16GB RAM. That already gave a dramatic improvement of 2 missed slots every 4 to 8 hours. But it wasn’t good enough. I tried to also upgrade my BP next to the same specs. Got 2 missed slots every 24 to 48 hours. Now I have just upgraded all my servers and also migrated to 1.29.0.

Still monitoring, but I’m expecting zero missed slots from now. Unfortunate that the specs and associated costs of running a stake pool are about double what I had originally planned, but I guess it’s worth it. Don’t want to miss a block when the time comes to be assigned one. If SPO out there is still on some Raspberry pi setup, might be a good time to consider a hardware and/or network upgrade. Don’t think the casual setup is going to work for much longer. Else, it’s you and your delegators that lose out.

Hope this very limited anecdotal evidence helps.

[Ticker: SGCO]

1 Like

My 5 cents:

  1. Also, do not allow more than one network switch between one relay and a BP.
  2. If possible - abandon the virtual infrastructure.
  3. Use fast SSD (NVMe) drives.
  4. Use at least 16 GB per server.