Hey guys,
I just lost a block during a strange period this epoch (Dec 7th). It was at around 8am CET. so not during the epoch switch.
There were no blocks minted for almost 5 minutes.
Mempool was big.
Many relays lost many connections.
No big processor usage increase.
I had a block scheduled during this period which was minted but was probably ghosted.
Feels like an epoch change where many nodes are busy doing some calculations?
I looked at the same time 5 days ago and this situation was not found.
Any idea whatâs happening? Maybe due to some changes in 1.35.4?
I think I have seen some discussion (though not in the forum, IOG Discord maybe?) that those 5 minute breaks happen occasionally with no conclusion as to why exactly.
Fundamentally there will be block production gaps - the intervals follow a Poisson distribution (a consequence of the random mechanism for establishing leadership). This consequence comes with having a hidden (until used) leadership schedule.
We would expect such gaps about one or twice a year (on average).
I think you are right. I only have evidence for my own block
So then the question is, why did my block not get adopted.
Also, what would explain the number of relay connections falling off during that period?
For CIP-Leios Iâm proposing time deterministic periodic RB EB and IB production with a random unknown leader. It is not necessary to have block timing randomness and leader randomness IMHO. My idea would prevent long time gaps.
However this seems like a parallel programming problem in the node. Network connections should not be dropped because the node is busy with a single task that blocks everything. We should not have things fall apart at epoch boundaries. Leios doesnât matter until the node as it is now is smooth.
What does the internal parallelism look like? Anybody have a document to share?
do you (normally) have good connectivity to other peers? Although the network connections do âtimeoutâ (they canât tell the difference between such a âlongâ gap and the remote peer failing) those timeouts are staggered so that overall connectivity is re-established.
That is not what happens, there is a range of timeouts (set at different times to track different probabilities of false positives) - the time out is about lack of communication (as you canât measure - remotely - how busy a node is, only that it didnât communicate)
The problem with determinism (in this context) is exposes the identity of the peer performing the action; that, in turn, exposes it to a targeted denial of service attack. That would not be an acceptable trade.
This is why the random schedule is used - there is no a priori information as to which node is going to perform the action
You know this problem is because the random ~21600ish leaders are distributed randomly within 432,000 slots with potential for large gaps. You arenât just picking random secret leaders you are also picking random block cadence.
Fine, for Leios Iâm recommending a random secret leader with 21600 slots/epoch for RBâs, a random secret leader with 42,300? slots/epoch for EBâs, and a random secret leader with 432,000 slots per epoch for IBâs. It seems I will have to continue to improve my communication to get this idea across.
Yes, I normally have good connectivity with other peers on all my 7 relays. I have an alert whenever number of connection falls below a threshold and this does not happen. Let me know if you need any other specific info to help with this.
Just got aware of this topic.
in my opinion the drop of remote peer connections was not the cause for the block pause, but the other way around: (p2p) nodes will drop remote connections if a remote peer does not announce a new block-height for a certain time.
In a situation where no new blocks are produced there is nothing to pipeline as new header, and so the connections drop, as visible in the posted graphs with some 1-2 of delay.
by counting the slot gaps between all blocks of an epoch, this gives such a distribution
the light orange bar are slot battles (5% of random collisions for the same slot)
the blue bars are the effectively counted gaps between blocks
the dark orange bars where not counted but interpolated to fill up the expected distribution curve. This amount is what is lost in density by height battles.
As we can see, block interruptions of more than 100 seconds are quite rare. However, a total of 65 interruptions of more than 120 seconds were counted in epoch 376
now letâs look at another graph, showing the max block delay for every epoch since genesis:
letâs ignore the byron epochs 0-206. They had a BFT PoS and different slot settings in a federated network. Only in Shelley the randomness of Ouroboros PoS jumped in.
There are two Y-axis, showing the same time, just for convinience at left in seconds and at right in minutes
This graph shows two values for each epoch:
The blue line looks explicitely for the epoch boundary gap.
The orange bar shows the longest block gap in the whole epoch, except the boundary gap.
We can see, practically each epoch has at least one 3-4 minute gap. and 5-6 minutes are also possible.
But: We have two pools in this thread who would have had a block assigned in this pause. The drop in remote connections would explain why they didnât get adopted, but not the other way round.
yes, definitively intersting to look into.
May I ask (the) about their topology?
A possible explanation would be, they connect only to IOG (default topology) and arenât subscribed to TU. If they are small in active stake and had only a few incoming connections from p2p-relays they may all dropped their connections after a 2min phase without blocks. ( <<< just guessing )