No blocks during the epoch for almost 5 minutes?!

Hey guys,
I just lost a block during a strange period this epoch (Dec 7th). It was at around 8am CET. so not during the epoch switch.
There were no blocks minted for almost 5 minutes.
Mempool was big.
Many relays lost many connections.
No big processor usage increase.
I had a block scheduled during this period which was minted but was probably ghosted.
image
image
image

Feels like an epoch change where many nodes are busy doing some calculations?
I looked at the same time 5 days ago and this situation was not found.
Any idea what’s happening? Maybe due to some changes in 1.35.4?

Update: also no slots were missed by the BP:
image

3 Likes

I think I have seen some discussion (though not in the forum, IOG Discord maybe?) that those 5 minute breaks happen occasionally with no conclusion as to why exactly.

3 Likes

OK thanks for the feedback; hopefully we will find out soon. some blocks were definitely lost during that period

2 Likes

Yeah, it was IOG discord:

@georgem1976 wrote:

4 Likes

This sounds like expected behaviour.

Fundamentally there will be block production gaps - the intervals follow a Poisson distribution (a consequence of the random mechanism for establishing leadership). This consequence comes with having a hidden (until used) leadership schedule.

We would expect such gaps about one or twice a year (on average).

4 Likes

What evidence do you have for this statement? The evidence above doesn’t support that statement?

1 Like

I think you are right. I only have evidence for my own block :slight_smile:
So then the question is, why did my block not get adopted.
Also, what would explain the number of relay connections falling off during that period?

2 Likes

But @ADA4Good wrote:

So, it is not a consequence of leadership randomness, is it?

3 Likes

For CIP-Leios I’m proposing time deterministic periodic RB EB and IB production with a random unknown leader. It is not necessary to have block timing randomness and leader randomness IMHO. My idea would prevent long time gaps.

However this seems like a parallel programming problem in the node. Network connections should not be dropped because the node is busy with a single task that blocks everything. We should not have things fall apart at epoch boundaries. Leios doesn’t matter until the node as it is now is smooth.

What does the internal parallelism look like? Anybody have a document to share?

4 Likes

do you (normally) have good connectivity to other peers? Although the network connections do “timeout” (they can’t tell the difference between such a ‘long’ gap and the remote peer failing) those timeouts are staggered so that overall connectivity is re-established.

1 Like

That is not what happens, there is a range of timeouts (set at different times to track different probabilities of false positives) - the time out is about lack of communication (as you can’t measure - remotely - how busy a node is, only that it didn’t communicate)

2 Likes

The problem with determinism (in this context) is exposes the identity of the peer performing the action; that, in turn, exposes it to a targeted denial of service attack. That would not be an acceptable trade.

This is why the random schedule is used - there is no a priori information as to which node is going to perform the action

2 Likes

You know this problem is because the random ~21600ish leaders are distributed randomly within 432,000 slots with potential for large gaps. You aren’t just picking random secret leaders you are also picking random block cadence.

Fine, for Leios I’m recommending a random secret leader with 21600 slots/epoch for RB’s, a random secret leader with 42,300? slots/epoch for EB’s, and a random secret leader with 432,000 slots per epoch for IB’s. It seems I will have to continue to improve my communication to get this idea across.

Random secret leader with deterministic cadence.

3 Likes

Yes, I normally have good connectivity with other peers on all my 7 relays. I have an alert whenever number of connection falls below a threshold and this does not happen. Let me know if you need any other specific info to help with this.

1 Like

We missed a block at around the same time. Definitely over 4 minutes of no blocks.
Here is my log on the bp

2 Likes

Just got aware of this topic.
in my opinion the drop of remote peer connections was not the cause for the block pause, but the other way around: (p2p) nodes will drop remote connections if a remote peer does not announce a new block-height for a certain time.
In a situation where no new blocks are produced there is nothing to pipeline as new header, and so the connections drop, as visible in the posted graphs with some 1-2 of delay.

by counting the slot gaps between all blocks of an epoch, this gives such a distribution

the light orange bar are slot battles (5% of random collisions for the same slot)
the blue bars are the effectively counted gaps between blocks
the dark orange bars where not counted but interpolated to fill up the expected distribution curve. This amount is what is lost in density by height battles.

As we can see, block interruptions of more than 100 seconds are quite rare. However, a total of 65 interruptions of more than 120 seconds were counted in epoch 376

now let’s look at another graph, showing the max block delay for every epoch since genesis:

let’s ignore the byron epochs 0-206. They had a BFT PoS and different slot settings in a federated network. Only in Shelley the randomness of Ouroboros PoS jumped in.

There are two Y-axis, showing the same time, just for convinience at left in seconds and at right in minutes

This graph shows two values for each epoch:
The blue line looks explicitely for the epoch boundary gap.
The orange bar shows the longest block gap in the whole epoch, except the boundary gap.

We can see, practically each epoch has at least one 3-4 minute gap. and 5-6 minutes are also possible.

3 Likes

But: We have two pools in this thread who would have had a block assigned in this pause. The drop in remote connections would explain why they didn’t get adopted, but not the other way round.

1 Like

yes, definitively intersting to look into.
May I ask (the) about their topology?

A possible explanation would be, they connect only to IOG (default topology) and aren’t subscribed to TU. If they are small in active stake and had only a few incoming connections from p2p-relays they may all dropped their connections after a 2min phase without blocks. ( <<< just guessing )

1 Like

Both of my relay’s are using Topology Updater.

by looking at one of my relays blockheight and remote peer connections 


image

I see the block pause at 7:16 - 7:21 GMT
8 of 18 remote peers dropped their connections but quickly re-established then

at 08:00 GMT there was another such block pause.

Also I can’t see any extraordinary spike in CPU, Mem, Disk or NetIO at that times.

Can I have a look at the logs of that minutes? (DM)

3 Likes