No blocks during the epoch for almost 5 minutes?!

ADA4Good · 7 December 2022 11:09

Hey guys,
I just lost a block during a strange period this epoch (Dec 7th). It was at around 8am CET. so not during the epoch switch.
There were no blocks minted for almost 5 minutes.
Mempool was big.
Many relays lost many connections.
No big processor usage increase.
I had a block scheduled during this period which was minted but was probably ghosted.

Feels like an epoch change where many nodes are busy doing some calculations?
I looked at the same time 5 days ago and this situation was not found.
Any idea what’s happening? Maybe due to some changes in 1.35.4?

Update: also no slots were missed by the BP:

HeptaSean · 7 December 2022 11:24

I think I have seen some discussion (though not in the forum, IOG Discord maybe?) that those 5 minute breaks happen occasionally with no conclusion as to why exactly.

ADA4Good · 7 December 2022 11:27

OK thanks for the feedback; hopefully we will find out soon. some blocks were definitely lost during that period

HeptaSean · 7 December 2022 11:29

Yeah, it was IOG discord:

@georgem1976 wrote:

intutionismRus · 7 December 2022 13:41

This sounds like expected behaviour.

Fundamentally there will be block production gaps - the intervals follow a Poisson distribution (a consequence of the random mechanism for establishing leadership). This consequence comes with having a hidden (until used) leadership schedule.

We would expect such gaps about one or twice a year (on average).

intutionismRus · 7 December 2022 13:49

What evidence do you have for this statement? The evidence above doesn’t support that statement?

ADA4Good · 7 December 2022 13:54

I think you are right. I only have evidence for my own block
So then the question is, why did my block not get adopted.
Also, what would explain the number of relay connections falling off during that period?

HeptaSean · 7 December 2022 14:18

But @ADA4Good wrote:

So, it is not a consequence of leadership randomness, is it?

Michael.Liesenfelt · 7 December 2022 15:48

For CIP-Leios I’m proposing time deterministic periodic RB EB and IB production with a random unknown leader. It is not necessary to have block timing randomness and leader randomness IMHO. My idea would prevent long time gaps.

However this seems like a parallel programming problem in the node. Network connections should not be dropped because the node is busy with a single task that blocks everything. We should not have things fall apart at epoch boundaries. Leios doesn’t matter until the node as it is now is smooth.

What does the internal parallelism look like? Anybody have a document to share?

intutionismRus · 7 December 2022 18:36

do you (normally) have good connectivity to other peers? Although the network connections do “timeout” (they can’t tell the difference between such a ‘long’ gap and the remote peer failing) those timeouts are staggered so that overall connectivity is re-established.

intutionismRus · 7 December 2022 18:38

That is not what happens, there is a range of timeouts (set at different times to track different probabilities of false positives) - the time out is about lack of communication (as you can’t measure - remotely - how busy a node is, only that it didn’t communicate)

intutionismRus · 7 December 2022 18:40

The problem with determinism (in this context) is exposes the identity of the peer performing the action; that, in turn, exposes it to a targeted denial of service attack. That would not be an acceptable trade.

This is why the random schedule is used - there is no a priori information as to which node is going to perform the action

Michael.Liesenfelt · 7 December 2022 19:04

You know this problem is because the random ~21600ish leaders are distributed randomly within 432,000 slots with potential for large gaps. You aren’t just picking random secret leaders you are also picking random block cadence.

Fine, for Leios I’m recommending a random secret leader with 21600 slots/epoch for RB’s, a random secret leader with 42,300? slots/epoch for EB’s, and a random secret leader with 432,000 slots per epoch for IB’s. It seems I will have to continue to improve my communication to get this idea across.

Random secret leader with deterministic cadence.

ADA4Good · 7 December 2022 21:45

Yes, I normally have good connectivity with other peers on all my 7 relays. I have an alert whenever number of connection falls below a threshold and this does not happen. Let me know if you need any other specific info to help with this.

BBHMM · 9 December 2022 18:42

We missed a block at around the same time. Definitely over 4 minutes of no blocks.
Here is my log on the bp

werkof · 10 December 2022 19:57

Just got aware of this topic.
in my opinion the drop of remote peer connections was not the cause for the block pause, but the other way around: (p2p) nodes will drop remote connections if a remote peer does not announce a new block-height for a certain time.
In a situation where no new blocks are produced there is nothing to pipeline as new header, and so the connections drop, as visible in the posted graphs with some 1-2 of delay.

by counting the slot gaps between all blocks of an epoch, this gives such a distribution

the light orange bar are slot battles (5% of random collisions for the same slot)
the blue bars are the effectively counted gaps between blocks
the dark orange bars where not counted but interpolated to fill up the expected distribution curve. This amount is what is lost in density by height battles.

As we can see, block interruptions of more than 100 seconds are quite rare. However, a total of 65 interruptions of more than 120 seconds were counted in epoch 376

now let’s look at another graph, showing the max block delay for every epoch since genesis:

let’s ignore the byron epochs 0-206. They had a BFT PoS and different slot settings in a federated network. Only in Shelley the randomness of Ouroboros PoS jumped in.

There are two Y-axis, showing the same time, just for convinience at left in seconds and at right in minutes

This graph shows two values for each epoch:
The blue line looks explicitely for the epoch boundary gap.
The orange bar shows the longest block gap in the whole epoch, except the boundary gap.

We can see, practically each epoch has at least one 3-4 minute gap. and 5-6 minutes are also possible.

HeptaSean · 10 December 2022 20:01

But: We have two pools in this thread who would have had a block assigned in this pause. The drop in remote connections would explain why they didn’t get adopted, but not the other way round.

werkof · 10 December 2022 20:57

yes, definitively intersting to look into.
May I ask (the) about their topology?

A possible explanation would be, they connect only to IOG (default topology) and aren’t subscribed to TU. If they are small in active stake and had only a few incoming connections from p2p-relays they may all dropped their connections after a 2min phase without blocks. ( <<< just guessing )

BBHMM · 10 December 2022 21:01

Both of my relay’s are using Topology Updater.

werkof · 10 December 2022 21:14

by looking at one of my relays blockheight and remote peer connections …

I see the block pause at 7:16 - 7:21 GMT
8 of 18 remote peers dropped their connections but quickly re-established then

at 08:00 GMT there was another such block pause.

Also I can’t see any extraordinary spike in CPU, Mem, Disk or NetIO at that times.

Can I have a look at the logs of that minutes? (DM)

Topic		Replies	Views
1.5M Stake - 4 Epochs and no slots Operate a Stake Pool	13	805	4 July 2022
Concerned about my pool config (crazy lack of blocks) Operate a Stake Pool	6	97	10 January 2025
We were SlotLeader but we did not minted a BLOCK Operate a Stake Pool	36	3660	16 April 2021
Missed Blocks Operate a Stake Pool	9	630	3 August 2021
Question about a staking pool and not recieving rewards during last epochs Stake Delegation	13	1060	28 December 2021

No blocks during the epoch for almost 5 minutes?!

Related topics