Missed slot - trying to understand what happend


So today we missed a block and we are now investigating logs trying to understand what happened. Our block producer node is running with 12GB of RAM and 4vCPUs. According to graphs CPU utulization was about 16% at the block creation time and node had 3.2GB free memory (cached) plus we have 8GB of SWAP. For this reason we don’t think this was resource issue but probably propagation delays.

Our block was scheduled at slot 45755179 (11:31:10) but 2 slots before (45755177) there was another block created by different stake pool with the same height 6521190. Please see below logs





Based on the logs above we suspect the other pool block didn’t get to our BP on time and our pool tried to create block (fork the chain) with the same height but it was refused by the network due to higher slot number? Also, next block (after ours) was created 2 slots later so the pool that was creating blocks would receive both ouers and previous block almost at the same time and selected block with the lower slot number (created earlier). Does that make sense?

Here’s the link to the other pool block and its propagation delays. What we see is that there are only few relays getting that block. Not sure if that’s issue with pooltool or something else?

Was this just a slot height battle?

This is a ghosted block.
You did not receive the previous block in time which has the same block number.
Therefore your block did not include this other block (2 slots earlier) in his chain history.
As both blocks have the same number other nodes needed to decide which block to keep.
The other block was earlier and is selected.

This is some explanation which I gave to my delegators when the same thing happened recently to me:

Ghosted means that the block was minted but it was reverted by other pools.
A ghosted block is not generating rewards. Anyways it’s a normal happening in Cardano currently.

So what happened in detail:

  • Exactly one slot before our assigned slot a block was minted by another pool
  • This slot was not propagated (sent to other nodes) within one second (actually we received it 2 second after it was minted)
  • As this block was not visible to our pool, our pool minted a block which was based on prior block (and not included the one which was minted 1 second before)
  • As our pool does not know about the other block at this time it generates the new block with the same block number as the other block
  • When we are distributing our block then each pool will decide which of the 2 resulting chains to choose.
  • In this case the other block wins because it has the same block number but was minted earlier and reaching other pools earlier as well.

Who is responsible for this?

  • Generally one could say that the other pool is responsible for propagating it’s block in time. Some pools have a poorly synchronised time
  • In this case this was not the reason. The reason is an increased propagation time which occurs in a time-frame of 48h - 72h after Epoch switch.
  • During this time frame Rewards are calculated which impacts Node performance of all pools and by that increases the avg propagation time to ~1.5s
  • This means in any 2 slot directly after each other assignments the second block is expected to be ghosted.

How often does this happen?

  • The chance to have a block right after out block is 5%
  • The affected time frame (48h-72h) is one fifth of the Epoch time
  • By that, the risk for a ghosted scenario is 1%

Thanks for the clear explanation. Will running more than 2 relay nodes in diverse location help?

Additional relays may help reduce propagation delays. But I think in this case this would not have helped.
Take a look at the block which was actually minted by ECP:

There you can take a look at the propagation delays. The default Chart shows only delays up to 2s.
To get the full picture I downloaded the JSON and visualized it.

Propagation Delays

Almost all nodes hat a delay of >2s for this block. So you had no chance to receive it in time.

Yes, exploring blocks just before and after 6521190 we can see that propagation delays was high for most of the blocks produced at that time. The time frame is

6521189  BNP41 at 12:30:51  - minted
6521190  ECP   at 12:31:08  - minted
6521190  VLHLA at 12:31:10  - minted but ghosted
6521191  BCSH  at 12:31:12  - propagation delays not collected by Pooltool 

BCSH pool gets both blocks from ECP/VLHLA but since ECP was created 2 slots (seconds) earlier it adds ECP block to the chain

So the conclusion is high propagation time for blocks produced at that time plus 3 blocks created in succession shortly (2 seconds) after each other (ECP/VLHLA/BCSH) caused our block to be ghosted. Bit unfortunate timing but we accept this as it’s general network issue.

It’s bad luck. But it will happen to ~1% of all blocks currently. Only thing that can be done is tuning propagation times and make sure time is synced through chrony.

There is a ‘bug’ in the node that causes a delay in block propagation from about 48hrs into the epoch until about 72 hrs into the epoch. This is since node version 1.29.0 and isn’t resolved yet. Every epoch since then, there’re a few pools missing blocks during that timeframe. If you slot was also, that’ll be the reason and there’s little you can do about it.

Yes, and no :slight_smile:
So this bug is causing slots to propagate in 1.5s on average during the 48h+24h window.
But in this case, the block was not propagated after 2s.

So I think trying to tune the propagation behaviour of each individual pool would help avoid (but not eliminate) that issue.

I just published an article about how we (xSPO Alliance) think the risk of ghosted blocks could be reduced. It requires participation of any pool. So I’d be glad if many of you guys are taking a look and giving feedback on it :slight_smile: