I just lost a block during a strange period this epoch (Dec 7th). It was at around 8am CET. so not during the epoch switch.
There were no blocks minted for almost 5 minutes.
Mempool was big.
Many relays lost many connections.
No big processor usage increase.
I had a block scheduled during this period which was minted but was probably ghosted.
Feels like an epoch change where many nodes are busy doing some calculations?
I looked at the same time 5 days ago and this situation was not found.
Any idea what’s happening? Maybe due to some changes in 1.35.4?
Fundamentally there will be block production gaps - the intervals follow a Poisson distribution (a consequence of the random mechanism for establishing leadership). This consequence comes with having a hidden (until used) leadership schedule.
We would expect such gaps about one or twice a year (on average).
I think you are right. I only have evidence for my own block
So then the question is, why did my block not get adopted.
Also, what would explain the number of relay connections falling off during that period?
For CIP-Leios I’m proposing time deterministic periodic RB EB and IB production with a random unknown leader. It is not necessary to have block timing randomness and leader randomness IMHO. My idea would prevent long time gaps.
However this seems like a parallel programming problem in the node. Network connections should not be dropped because the node is busy with a single task that blocks everything. We should not have things fall apart at epoch boundaries. Leios doesn’t matter until the node as it is now is smooth.
What does the internal parallelism look like? Anybody have a document to share?
do you (normally) have good connectivity to other peers? Although the network connections do “timeout” (they can’t tell the difference between such a ‘long’ gap and the remote peer failing) those timeouts are staggered so that overall connectivity is re-established.
That is not what happens, there is a range of timeouts (set at different times to track different probabilities of false positives) - the time out is about lack of communication (as you can’t measure - remotely - how busy a node is, only that it didn’t communicate)
The problem with determinism (in this context) is exposes the identity of the peer performing the action; that, in turn, exposes it to a targeted denial of service attack. That would not be an acceptable trade.
This is why the random schedule is used - there is no a priori information as to which node is going to perform the action
You know this problem is because the random ~21600ish leaders are distributed randomly within 432,000 slots with potential for large gaps. You aren’t just picking random secret leaders you are also picking random block cadence.
Fine, for Leios I’m recommending a random secret leader with 21600 slots/epoch for RB’s, a random secret leader with 42,300? slots/epoch for EB’s, and a random secret leader with 432,000 slots per epoch for IB’s. It seems I will have to continue to improve my communication to get this idea across.
Yes, I normally have good connectivity with other peers on all my 7 relays. I have an alert whenever number of connection falls below a threshold and this does not happen. Let me know if you need any other specific info to help with this.
Just got aware of this topic.
in my opinion the drop of remote peer connections was not the cause for the block pause, but the other way around: (p2p) nodes will drop remote connections if a remote peer does not announce a new block-height for a certain time.
In a situation where no new blocks are produced there is nothing to pipeline as new header, and so the connections drop, as visible in the posted graphs with some 1-2 of delay.
by counting the slot gaps between all blocks of an epoch, this gives such a distribution
the light orange bar are slot battles (5% of random collisions for the same slot)
the blue bars are the effectively counted gaps between blocks
the dark orange bars where not counted but interpolated to fill up the expected distribution curve. This amount is what is lost in density by height battles.
As we can see, block interruptions of more than 100 seconds are quite rare. However, a total of 65 interruptions of more than 120 seconds were counted in epoch 376
now let’s look at another graph, showing the max block delay for every epoch since genesis:
yes, definitively intersting to look into.
May I ask (the) about their topology?
A possible explanation would be, they connect only to IOG (default topology) and aren’t subscribed to TU. If they are small in active stake and had only a few incoming connections from p2p-relays they may all dropped their connections after a 2min phase without blocks. ( <<< just guessing )