Further update on the slot issue. Thanks to the folks at the xSPO alliance I was able to gather more information and we might have been able to pinpoint CPU spikes and logs that corollate to either a dropped slot or high CPU usage
using a cronjob and mpstat I was able to log each of my 8 cores every second. Then thanks to some scripts written I can pinpoint the second I lost a slot. Here is the Raw data from a slot that I missed
Timestamp 12:05:24 is when the leadershipCheck did not send
Aug 02 00:05:22 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:22.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296431,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:23 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:23.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296432,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:25 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:25.11 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296434,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:26 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:26.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296435,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
So here is where it gets interesting with the cores again same time stamp
12:05:21 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:22 AM all 1.37 0.00 0.87 0.00 0.00 0.25 0.00 0.00 0.00 97.51
12:05:22 AM 0 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
12:05:22 AM 1 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
12:05:22 AM 2 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
12:05:22 AM 3 0.97 0.00 1.94 0.00 0.00 0.00 0.00 0.00 0.00 97.09
12:05:22 AM 4 9.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 90.00
12:05:22 AM 5 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
12:05:22 AM 6 0.99 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 98.02
12:05:22 AM 7 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 98.99
12:05:22 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:23 AM all 2.63 0.00 0.50 0.00 0.00 0.63 0.00 0.00 0.00 96.25
12:05:23 AM 0 0.00 0.00 0.97 0.00 0.00 3.88 0.00 0.00 0.00 95.15
12:05:23 AM 1 0.99 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 98.02
12:05:23 AM 2 0.99 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 97.03
12:05:23 AM 3 8.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 91.84
12:05:23 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:23 AM 5 9.90 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 89.11
12:05:23 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:23 AM 7 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
12:05:23 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:24 AM all 34.74 0.00 0.50 0.00 0.00 0.37 0.00 0.00 0.00 64.38
12:05:24 AM 0 68.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 32.00
12:05:24 AM 1 64.65 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 35.35
12:05:24 AM 2 0.00 0.00 0.00 0.00 0.00 2.94 0.00 0.00 0.00 97.06
12:05:24 AM 3 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
12:05:24 AM 4 11.88 0.00 1.98 0.00 0.00 0.00 0.00 0.00 0.00 86.14
12:05:24 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:24 AM 6 66.67 0.00 0.98 0.00 0.00 0.00 0.00 0.00 0.00 32.35
12:05:24 AM 7 67.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33.00
12:05:24 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:25 AM all 25.94 0.00 0.25 0.00 0.00 0.50 0.00 0.00 0.00 73.32
12:05:25 AM 0 51.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 48.00
12:05:25 AM 1 51.96 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 47.06
12:05:25 AM 2 0.00 0.00 0.00 0.00 0.00 2.97 0.00 0.00 0.00 97.03
12:05:25 AM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:25 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:25 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:25 AM 6 51.52 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 47.47
12:05:25 AM 7 53.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 47.00
12:05:25 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:26 AM all 2.50 0.00 0.87 0.00 0.00 0.25 0.00 0.00 0.00 96.38
12:05:26 AM 0 7.07 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 91.92
12:05:26 AM 1 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00
12:05:26 AM 2 0.99 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 97.03
12:05:26 AM 3 0.00 0.00 1.98 0.00 0.00 0.99 0.00 0.00 0.00 97.03
12:05:26 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:26 AM 5 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00
12:05:26 AM 6 8.91 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 90.10
Second 23 and previous is low CPU usage and for 2 seconds we have high usages across the cores the second 26 and onward we are back to low usage.
I then look at the cardano node logs and come across something interesting
Aug 02 00:05:23 prod01 cardano-node[525026]: [prod01:cardano.node.Forge:Info:70] [2021-08-02 00:05:23.00 UTC] fromList [(“val”,Object (fromList [(“kind”,String “TraceNodeNotLeader”),(“slot”,Number 3.6296432e7)])),(“credentials”,String “Cardano”)]
Aug 02 00:05:25 prod01 cardano-node[525026]: 94174488 2040688456 2100863184 4.646 1.194 6023.167 104493.225 0 0 (Gen: 1)
Aug 02 00:05:25 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:25.11 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296434,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:25 prod01 cardano-node[525026]: [prod01:cardano.node.Forge:Info:70] [2021-08-02 00:05:25.12 UTC] fromList [(“val”,Object (fromList [(“kind”,String “TraceNodeNotLeader”),(“slot”,Number 3.6296434e7)])),(“credentials”,String “Cardano”)]
Aug 02 00:05:26 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:26.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296435,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
I then run a | grep on the log for this GEN: 1
journalctl --unit=cardano-node --since=‘2021-08-01 00:00:23’ --until=‘2021-08-02 00:05:27’ | grep “(Gen: 1)”
each of my missed slots is at the same time as this log entry, but some of them I didn’t miss the slot.
I take for example for me 23.43.40 where I did send a leadership script
Aug 01 23:43:39 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-01 23:43:39.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.9470693e-2,“slot”:36295128,“delegMapSize”:722762,“utxoSize”:2765484,“credentials”:“Cardano”}
Aug 01 23:43:40 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-01 23:43:40.85 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.9470693e-2,“slot”:36295129,“delegMapSize”:722762,“utxoSize”:2765484,“credentials”:“Cardano”}
Aug 01 23:43:41 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-01 23:43:41.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.9470693e-2,“slot”:36295130,“delegMapSize”:722762,“utxoSize”:2765484,“credentials”:“Cardano”}
There is gen: 1 in the logs
Aug 01 23:43:40 prod01 cardano-node[525026]: 119090344 2045655472 2104467496 4.912 1.266 5973.130 103188.966 0 0 (Gen: 1)
And sure enough there is the 2 second CPU spike
11:43:38 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:39 PM all 2.37 0.00 0.75 0.00 0.00 0.62 0.00 0.00 0.00 96.26
11:43:39 PM 0 0.98 0.00 0.00 0.00 0.00 1.96 0.00 0.00 0.00 97.06
11:43:39 PM 1 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:39 PM 2 1.98 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 97.03
11:43:39 PM 3 6.86 0.00 1.96 0.00 0.00 1.96 0.00 0.00 0.00 89.22
11:43:39 PM 4 9.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 90.00
11:43:39 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
11:43:39 PM 6 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
11:43:39 PM 7 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 98.99
11:43:39 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:40 PM all 38.81 0.00 0.25 0.00 0.00 0.75 0.00 0.00 0.00 60.20
11:43:40 PM 0 74.75 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 25.25
11:43:40 PM 1 3.81 0.00 0.95 0.00 0.00 5.71 0.00 0.00 0.00 89.52
11:43:40 PM 2 74.75 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 25.25
11:43:40 PM 3 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:40 PM 4 71.72 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 28.28
11:43:40 PM 5 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.01
11:43:40 PM 6 74.26 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 24.75
11:43:40 PM 7 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 88.00
11:43:40 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:41 PM all 26.00 0.00 0.62 0.00 0.00 0.37 0.00 0.00 0.00 73.01
11:43:41 PM 0 51.49 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 47.52
11:43:41 PM 1 0.00 0.00 0.00 0.00 0.00 2.94 0.00 0.00 0.00 97.06
11:43:41 PM 2 51.49 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 47.52
11:43:41 PM 3 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:41 PM 4 52.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 48.00
11:43:41 PM 5 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:41 PM 6 52.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 47.00
11:43:41 PM 7 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:41 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:42 PM all 2.88 0.00 0.75 0.00 0.00 0.13 0.00 0.00 0.00 96.24
11:43:42 PM 0 0.99 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 97.03
11:43:42 PM 1 0.00 0.00 1.98 0.00 0.00 0.00 0.00 0.00 0.00 98.02
11:43:42 PM 2 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 98.99
11:43:42 PM 3 0.99 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 98.02
11:43:42 PM 4 2.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 97.98
11:43:42 PM 5 9.09 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 89.90
11:43:42 PM 6 9.18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 90.82
11:43:42 PM 7 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:42 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:43 PM all 2.63 0.00 0.38 0.00 0.00 0.50 0.00 0.00 0.00 96.50
11:43:43 PM 0 0.97 0.00 0.97 0.00 0.00 1.94 0.00 0.00 0.00 96.12
11:43:43 PM 1 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 0.00 98.99
11:43:43 PM 2 4.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 95.96
11:43:43 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
11:43:43 PM 4 4.95 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 93.07
11:43:43 PM 5 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
11:43:43 PM 6 9.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 90.91
For what it is worth there is also a Gen: 0 that is much more frequent in the logs and there is no spike in CPU usage during those times.
We will update this chain once we have more data and findings