No more missed slots (during epoch) after changing one setting

ADA4Good · 31 July 2021 22:31

Hello everybody,

A couple of SPO’s in the xSPO alliance have been working to understand what seems to cause the regular missed slots most of us seem to see. We are not far enough in our investigation to report the results but I can already share a finding that might help other SPO’s who seem to have regular missed slots.

We care to eliminate missed slots since a leader slot that is so unlucky to be one of those missed slots would cause a missed block. And we don’t want that obviously, no matter how small the chances of this happening.

My BP with 16GB mem and 6 cores was showing about 1 or 2 missed slot per hour on average. This is excluding the missed slots everybody seems to have during the epoch switch.

After bumping the number of cores used by the node from 2 to 6 I have not had a single slot missed for almost 4 days. During this period everything worked fine and I minted 3 blocks.
Again, this is excluding the epoch switch, where also here, I saw missed slots.

I use the cntools setup and changed this setting in the cnode.sh file (located in the scripts folder). Then simply restarted the node.

cardsfan7189 · 1 August 2021 01:32

Thanks @ADA4Good for sharing. I will give it a try. I’m too small to mint blocks at this time but still have been trying to find ways to avoid the missed slots (20 - 40 per epoch). My last tweak was stopping logging according to one of the posts (can’t find it now) but that didn’t help. Will see how changing this setting works.

ADA4Good · 1 August 2021 01:44

By the way, I have tracemempool setting active (true)

Ruslan_Sendecky · 1 August 2021 10:44

Hi,

You asked me to checked for missed slots on the 1.28 version in a different thread. I’ve been monitoring it and it is an interesting pattern.
I am running a small script to check for missed slots every 5 minutes and log numbers. But it only creates an entry in the log if the number of missed slots increased from the previous number. interesting pattern though. It increments only in 2 for some reason (epoch change aside) roughly every 1.5 hours.
Look, something more sinister is going on there It doesn’t look random.

21-07-31_21:38:28 20
21-07-31_21:50:10 22
21-07-31_22:20:10 24
21-07-31_23:35:10 26
21-08-01_01:55:10 28
21-08-01_02:10:10 30
21-08-01_03:10:10 32
21-08-01_04:20:10 34
21-08-01_05:35:10 36
21-08-01_07:15:10 38
21-08-01_09:10:10 40
21-08-01_10:15:10 42
21-08-01_11:35:10 44
21-08-01_12:45:10 48
21-08-01_14:00:10 50
21-08-01_15:10:10 52
21-08-01_16:25:10 54
21-08-01_17:35:10 56
21-08-01_18:50:10 59
21-08-01_20:00:10 61

I have plenty of CPU power:

CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 2
Core(s) per socket: 3
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel Xeon Processor (Cascadelake)
Stepping: 6
CPU MHz: 2992.968

Markus-VITAL · 1 August 2021 19:19

Our current understanding of the situation:

The Slot Check is running every second in the Tick function
If the function is (for whatever reason) not able to run the function within a specific second (slot) this slot is missed
In case a block was assigned to the slot the Block is missed as well
The risk of this to happen is rather small for a low saturated pool. But with a number of 100 Slots / Epoch a full saturated pool would have a risk of ~1.4% to miss at least one Block per Epoch.
The average missed slots which we observed is ~250 Slots / Epoch which increases the risk by 2.5x
All pool are missing most of the slots during the Epoch boundary
Some pools are also missing slots continously during the Epoch.

Current working assumption:

The slots are missed because the CPU is fully blocked with other Tasks
Increasing the assigned Core to CNODE (default is 2) to 4 or 6 helped 2 affected pools to avoid the slots which are missed during an epoch.
We assume that this is caused by other CPU consuming tasks (e.g. a Garbage collection run). By increasing the Cores for CNODE while keeping the Cores for Garbage collection at 2 we assume to keep enough capacity free for the Tick function to execute successfully every second (and therefore not miss a slot) → This was sucessfully verified at 2 Pools
Anyways this does not help during the Epoch transition.
Most nodes are missing ~250 Slots during the Epoch transition. This comes together with a 4-5 Minutes increased CPU load.
The CPU load is affecting only 1 Core to 100%. Others cores are only temporarely jumping to >50%, but most time at much lower load (<25%)
We assume the CPU tasks during the Epoch transition are not multithreaded.
At the same time this Task blocks the Tick function from executing. Therefore increasing the Cores is not avoiding missed slots during the Epoch transition.

The Epoch transition Issue seems to not be adressable through additional Cores. We assume that using a CPU which runs at a higher clockspeed would reduce the amount of missed slots, as it would reduce the processing time during the transition. Anyways this would not eliminate the issue, just reduce the impact of it.

We assume that the issue can only be resolved by optimizing the CNODE Code (or rather Ourboros Code) to run the Tick function in an independent thread. Alternatively let the tick consider future Slots could also be an solution but not as clean as a multi-threading approach.

On a more general level we could consider the Tick Function as a real-time Software as it needs to excecute without any latencies every second. I consider running a software with a real-time requirement is conflicting with a Software Stack which relies on non-plannable loads like Garbage collection. Anyways this does not seem to resolve as an issue currently as the continously missed slots seems be be adressable by the approach described above.

We’ll let you know as soon as we found out more.
Any participation of Data about your systems helps getting more insight. Especially if the impact is different when running on different hosting configurations.Therefore we will provide a link to an Google Sheet allowing you to share your data soon.

IRF160 · 2 August 2021 02:48

Further update on the slot issue. Thanks to the folks at the xSPO alliance I was able to gather more information and we might have been able to pinpoint CPU spikes and logs that corollate to either a dropped slot or high CPU usage

using a cronjob and mpstat I was able to log each of my 8 cores every second. Then thanks to some scripts written I can pinpoint the second I lost a slot. Here is the Raw data from a slot that I missed

Timestamp 12:05:24 is when the leadershipCheck did not send

Aug 02 00:05:22 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:22.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296431,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:23 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:23.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296432,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:25 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:25.11 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296434,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:26 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:26.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296435,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}

So here is where it gets interesting with the cores again same time stamp

12:05:21 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:22 AM all 1.37 0.00 0.87 0.00 0.00 0.25 0.00 0.00 0.00 97.51
12:05:22 AM 0 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
12:05:22 AM 1 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
12:05:22 AM 2 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
12:05:22 AM 3 0.97 0.00 1.94 0.00 0.00 0.00 0.00 0.00 0.00 97.09
12:05:22 AM 4 9.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 90.00
12:05:22 AM 5 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
12:05:22 AM 6 0.99 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 98.02
12:05:22 AM 7 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 98.99

12:05:22 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:23 AM all 2.63 0.00 0.50 0.00 0.00 0.63 0.00 0.00 0.00 96.25
12:05:23 AM 0 0.00 0.00 0.97 0.00 0.00 3.88 0.00 0.00 0.00 95.15
12:05:23 AM 1 0.99 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 98.02
12:05:23 AM 2 0.99 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 97.03
12:05:23 AM 3 8.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 91.84
12:05:23 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:23 AM 5 9.90 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 89.11
12:05:23 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:23 AM 7 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99

12:05:23 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:24 AM all 34.74 0.00 0.50 0.00 0.00 0.37 0.00 0.00 0.00 64.38
12:05:24 AM 0 68.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 32.00
12:05:24 AM 1 64.65 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 35.35
12:05:24 AM 2 0.00 0.00 0.00 0.00 0.00 2.94 0.00 0.00 0.00 97.06
12:05:24 AM 3 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
12:05:24 AM 4 11.88 0.00 1.98 0.00 0.00 0.00 0.00 0.00 0.00 86.14
12:05:24 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:24 AM 6 66.67 0.00 0.98 0.00 0.00 0.00 0.00 0.00 0.00 32.35
12:05:24 AM 7 67.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33.00

12:05:24 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:25 AM all 25.94 0.00 0.25 0.00 0.00 0.50 0.00 0.00 0.00 73.32
12:05:25 AM 0 51.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 48.00
12:05:25 AM 1 51.96 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 47.06
12:05:25 AM 2 0.00 0.00 0.00 0.00 0.00 2.97 0.00 0.00 0.00 97.03
12:05:25 AM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:25 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:25 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:25 AM 6 51.52 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 47.47
12:05:25 AM 7 53.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 47.00

12:05:25 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:05:26 AM all 2.50 0.00 0.87 0.00 0.00 0.25 0.00 0.00 0.00 96.38
12:05:26 AM 0 7.07 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 91.92
12:05:26 AM 1 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00
12:05:26 AM 2 0.99 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 97.03
12:05:26 AM 3 0.00 0.00 1.98 0.00 0.00 0.99 0.00 0.00 0.00 97.03
12:05:26 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:05:26 AM 5 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00
12:05:26 AM 6 8.91 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 90.10

Second 23 and previous is low CPU usage and for 2 seconds we have high usages across the cores the second 26 and onward we are back to low usage.

I then look at the cardano node logs and come across something interesting

Aug 02 00:05:23 prod01 cardano-node[525026]: [prod01:cardano.node.Forge:Info:70] [2021-08-02 00:05:23.00 UTC] fromList [(“val”,Object (fromList [(“kind”,String “TraceNodeNotLeader”),(“slot”,Number 3.6296432e7)])),(“credentials”,String “Cardano”)]
Aug 02 00:05:25 prod01 cardano-node[525026]: 94174488 2040688456 2100863184 4.646 1.194 6023.167 104493.225 0 0 (Gen: 1)
Aug 02 00:05:25 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:25.11 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296434,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}
Aug 02 00:05:25 prod01 cardano-node[525026]: [prod01:cardano.node.Forge:Info:70] [2021-08-02 00:05:25.12 UTC] fromList [(“val”,Object (fromList [(“kind”,String “TraceNodeNotLeader”),(“slot”,Number 3.6296434e7)])),(“credentials”,String “Cardano”)]
Aug 02 00:05:26 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-02 00:05:26.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.926974e-2,“slot”:36296435,“delegMapSize”:722782,“utxoSize”:2765692,“credentials”:“Cardano”}

I then run a | grep on the log for this GEN: 1

journalctl --unit=cardano-node --since=‘2021-08-01 00:00:23’ --until=‘2021-08-02 00:05:27’ | grep “(Gen: 1)”

each of my missed slots is at the same time as this log entry, but some of them I didn’t miss the slot.

I take for example for me 23.43.40 where I did send a leadership script

Aug 01 23:43:39 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-01 23:43:39.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.9470693e-2,“slot”:36295128,“delegMapSize”:722762,“utxoSize”:2765484,“credentials”:“Cardano”}
Aug 01 23:43:40 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-01 23:43:40.85 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.9470693e-2,“slot”:36295129,“delegMapSize”:722762,“utxoSize”:2765484,“credentials”:“Cardano”}
Aug 01 23:43:41 prod01 cardano-node[525026]: [prod01:cardano.node.LeadershipCheck:Info:70] [2021-08-01 23:43:41.00 UTC] {“kind”:“TraceStartLeadershipCheck”,“chainDensity”:4.9470693e-2,“slot”:36295130,“delegMapSize”:722762,“utxoSize”:2765484,“credentials”:“Cardano”}

There is gen: 1 in the logs
Aug 01 23:43:40 prod01 cardano-node[525026]: 119090344 2045655472 2104467496 4.912 1.266 5973.130 103188.966 0 0 (Gen: 1)

And sure enough there is the 2 second CPU spike
11:43:38 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:39 PM all 2.37 0.00 0.75 0.00 0.00 0.62 0.00 0.00 0.00 96.26
11:43:39 PM 0 0.98 0.00 0.00 0.00 0.00 1.96 0.00 0.00 0.00 97.06
11:43:39 PM 1 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:39 PM 2 1.98 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 97.03
11:43:39 PM 3 6.86 0.00 1.96 0.00 0.00 1.96 0.00 0.00 0.00 89.22
11:43:39 PM 4 9.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 90.00
11:43:39 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
11:43:39 PM 6 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
11:43:39 PM 7 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 98.99

11:43:39 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:40 PM all 38.81 0.00 0.25 0.00 0.00 0.75 0.00 0.00 0.00 60.20
11:43:40 PM 0 74.75 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 25.25
11:43:40 PM 1 3.81 0.00 0.95 0.00 0.00 5.71 0.00 0.00 0.00 89.52
11:43:40 PM 2 74.75 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 25.25
11:43:40 PM 3 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:40 PM 4 71.72 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 28.28
11:43:40 PM 5 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.01
11:43:40 PM 6 74.26 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 24.75
11:43:40 PM 7 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 88.00

11:43:40 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:41 PM all 26.00 0.00 0.62 0.00 0.00 0.37 0.00 0.00 0.00 73.01
11:43:41 PM 0 51.49 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 47.52
11:43:41 PM 1 0.00 0.00 0.00 0.00 0.00 2.94 0.00 0.00 0.00 97.06
11:43:41 PM 2 51.49 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 47.52
11:43:41 PM 3 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:41 PM 4 52.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 48.00
11:43:41 PM 5 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
11:43:41 PM 6 52.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 47.00
11:43:41 PM 7 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00

11:43:41 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:42 PM all 2.88 0.00 0.75 0.00 0.00 0.13 0.00 0.00 0.00 96.24
11:43:42 PM 0 0.99 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 97.03
11:43:42 PM 1 0.00 0.00 1.98 0.00 0.00 0.00 0.00 0.00 0.00 98.02
11:43:42 PM 2 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 98.99
11:43:42 PM 3 0.99 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 98.02
11:43:42 PM 4 2.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 97.98
11:43:42 PM 5 9.09 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 89.90
11:43:42 PM 6 9.18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 90.82
11:43:42 PM 7 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00

11:43:42 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:43:43 PM all 2.63 0.00 0.38 0.00 0.00 0.50 0.00 0.00 0.00 96.50
11:43:43 PM 0 0.97 0.00 0.97 0.00 0.00 1.94 0.00 0.00 0.00 96.12
11:43:43 PM 1 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 0.00 98.99
11:43:43 PM 2 4.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 95.96
11:43:43 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
11:43:43 PM 4 4.95 0.00 0.99 0.00 0.00 0.99 0.00 0.00 0.00 93.07
11:43:43 PM 5 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
11:43:43 PM 6 9.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 90.91

For what it is worth there is also a Gen: 0 that is much more frequent in the logs and there is no spike in CPU usage during those times.

We will update this chain once we have more data and findings

Ruslan_Sendecky · 2 August 2021 06:42

Thanks for the info.
As far as I remember, Haskell uses “sparks” for concurrency which is not an OS thread. You can have millions of sparks per real physical thread. They will execute on schedule. This effectively means that this is not parallel execution. That’s why you observe only one CPU core spiking.
Concurrency != parallelism
If this ticker is being blocked and waits for its turn in the scheduler, then yes, your explanation makes sense. This needs to run in a separate physical OS thread.
Has this been raised on GitHub?
I guess, even if you raise it, it won’t get high priority because it doesn’t effect overall network health. Who cares about a single individual pool missing a block if the block gets reassigned to another pool anyway. The beauty of decentralisation.
P.S. /rant/
Haskell is brilliant in many ways but for high load concurrent networking a modern language like Go with built in parallel execution would make more sense.

Markus-VITAL · 2 August 2021 06:42

@IRF160 Thanks for your details observations.
This means you found an example where a Big Garbage Collection (GC1) caused a CPU spike which was long enough to not successfully run the Leadership Validation (=missed slot).
Some details about the GC mechanism: https://www.channable.com/tech/lessons-in-managing-haskell-memory

Now there is a chance to optimize this scenario and tune the GC parameters.
But before that: Have you already tried increasing the Assigned Cores of the Cardano Node?

Markus-VITAL · 2 August 2021 06:49

Thanks for your reply! Very interesting and gives a better understanding of what is happening. So theoretical on code level it could also help if the “big” calculations in the Epoch transition are splitted into more “smaller” calculations to give the Tick function a change to be planned on the schedule.

We have not raised anything on GitHub so far. But I have seen discussions around that and IOG also gave out some analysis statement about the Epoch Change missed slots some time ago. So for sure they are aware of the issue.

ADA4Good · 2 August 2021 07:18

It is my understanding that this is not the case. If the leader of a slot does not mint that slot, that block is never minted since nobody else is allowed to mint that same block. Transaction will just move to the next leader (which is predetermined as the stake snapshot and nonce where available before the epoch started)

Correct me if I am wrong

Ruslan_Sendecky · 2 August 2021 07:30

Yes, you are mostly correct. It’s been a while since I played with Haskell so I might’ve forgotten things. Anyway…
You want to run the ticker in a non-blocking way. I don’t know what’s happening in the code as I haven’t read it.
Not all calculations can be “split” effectively because then you have to manage the communication between the threads if, for instance, one piece of calculation depends on the other. But, you are correct, time sensitive functions should stay away from heavy calculation threads. If that ticker function is very time sensitive then it should be dedicated a separate OS thread.
In Haskell, in Control.Concurrent you have two functions “forkIO” and “forkOS”. The first one creates an unbound green thread that can run on any OS thread. The second one is “bound” which means the thread gets assigned to a real OS thread and stays there. Haskell threads are managed by the runtime. OS threads are managed by well… the OS. It is more complicated than that though… Because an OS thread is scheduled as well and not guaranteed to have a dedicated physical CPU core (threads are much older than multi-core CPUs)
I am trying to attach an image here to help you to understand how Haskell threads work. Don’t know if it is going to work though.

Ruslan_Sendecky · 2 August 2021 07:51

You know what, you could be right. I am still discovering/learning the protocol
I am reading now and apparently the transactions get delayed and just wait for another block. But I am also finding few mentions that it is possible for another SPO to produce the same block but it is rare. Not sure if it is true though… Need to read a bit more.

On the general subject of missed slots. We could be looking at it the wrong way. It is possible that it has nothing to do with CPU blocking or scheduling etc.
Just found an interesting explanation of “missed slots”.
A missed slot it is a slot that CN cannot determine whether it is the leader or not!
This definition, if correct, changes the picture completely, and it is nothing to do with CPU. Could be many reasons why a node cannot determine if it is a leader or not. Interesting… Just an option. Food for thought.

ADA4Good · 2 August 2021 07:59

Yes, that is called “slot battles”. This is only possible when 2 or more SPO’s are elected to mint THE SAME slot. Then the lowest randomly generated number wins. The other way around is not possible, you have to be elected for a slot to mint a block in it. I have written a short blog about this process if you are interested:

How is a stake pool selected to mint a block? Slot Lottery Explained - Staking & Delegation / Operate a Stake Pool - Cardano Forum

Oh and I have experienced it already. looks like this:
Stolen block

If this would be true, everybody should see the same missed slot right? Which is not the case. We know for sure now that slower CPU’s get more missed slots.

Ruslan_Sendecky · 2 August 2021 08:06

Thanks. I’ll read it in a sec.
So if 2 or more SPOs are chosen to be a slot leader and one misses the block, then we don’t wait for the next block, right? The block just gets reassigned to another leader (among initially chosen)? Or …
The whole block and all initially chosen leaders are discarded and the transactions wait for a completely new block?

ADA4Good · 2 August 2021 08:09

In the not that common case where 2 leaders are assigned for the same slot, the one with the lowest random number gets to mint that block. If that “real” leader is not online or misses that slot I don’t actually know if the other leader is allowed to mint or the block is skipped. But one thing is for sure, after the nonce and stake snapshot are available all leaders are predetermined and if you are not a leader you will never mint a block. Moreover, the network checks all blocks to see if the VRF key matches otherwise the whole block is not even valid.

icq2006 · 3 August 2021 01:31

Since i’ve changed the CPU Cores in cnode.sh I haven’t missed a slot.

IPIB_Pool · 3 August 2021 08:13

Does anyone know how to change the assigned number of CPU for a setup based on Coincashew guide?
Thx

Ruslan_Sendecky · 3 August 2021 08:48

Hi,

Haskell executable is the same regardless of which guide you used to setup your pool.
First, the multicore must be enabled at compile time. I believe you have to pass these options to cabal. This should be enough. I think…

-rtsopts
-threaded

Then you should be able to pass this to your runtime: +RTS -N4
This is untested and off the top of my head. So please investigate and test

Cheers

Ruslan_Sendecky · 3 August 2021 10:00

Yep, just tested it myself. Works without a problem.
So basically, what you need to do is set the runtime option for the CPU cores.
Execute your cardano node binary like this:

“cardano-node run +RTS -N -RTS --the-rest-of-the-options”

If you don’t pass the number of cores, to the -N, it will use all available cores.

I am gonna set the missed slots logging again to see if this solves the problem.

Cheers

IPIB_Pool · 3 August 2021 11:33

Thanks a lot.

Topic		Replies	Views
1.27 missed slot epoch change over Operate a Stake Pool	4	548	21 May 2021
Too many Missed Slots happens, help me Operate a Stake Pool	72	4346	2 October 2022
Does more cores = less missed slots? Operate a Stake Pool	1	375	18 April 2022
Missed slot leader checks increased Operate a Stake Pool	1	583	30 October 2021
Missed slots Operate a Stake Pool	59	2579	8 October 2021

No more missed slots (during epoch) after changing one setting

Related topics