Missed slot - trying to understand what happend

t0m3kf · 19 November 2021 19:32

Hi,

So today we missed a block and we are now investigating logs trying to understand what happened. Our block producer node is running with 12GB of RAM and 4vCPUs. According to graphs CPU utulization was about 16% at the block creation time and node had 3.2GB free memory (cached) plus we have 8GB of SWAP. For this reason we don’t think this was resource issue but probably propagation delays.

Our block was scheduled at slot 45755179 (11:31:10) but 2 slots before (45755177) there was another block created by different stake pool with the same height 6521190. Please see below logs

{“thread”:“255”,“sev”:“Info”,“data”:{“val”:{“kind”:“TraceForgedBlock”,“blockPrev”:“ef0d800e1a05b1538f09635e74081c0fdf02578628f73e90d3934e50865c8d93”,“slot”:45755179,“block”:“36eb76af11cf93625c95806f9fef79de63b66d1946b0ba75411d3f6675221fbe”,“blockNo”:6521190},“credentials”:“Cardano”},“loc”:null,“env”:“1.31.0:2cbe3”,“msg”:"",“app”:,“host”:“core”,“at”:“2021-11-19T11:31:10.86Z”,“ns”:[“cardano.node.Forge”],“pid”:“1974952”}

{“thread”:“375”,“sev”:“Info”,“data”:{“kind”:“ChainSyncServerEvent.TraceChainSyncServerReadBlocked.AddBlock”,“slot”:45755177,“block”:“4ecb2ed84cf1ee3d72730c48c5977c2482738cd24ce79feca82fe0baf5314b5d”,“blockNo”:6521190},“loc”:null,“env”:“1.31.0:2cbe3”,“msg”:"",“app”:,“host”:“core”,“at”:“2021-11-19T11:31:12.33Z”,“ns”:[“cardano.node.ChainSyncHeaderServer”],“pid”:“1974952”}

{“thread”:“245”,“sev”:“Info”,“data”:{“kind”:“TraceAddBlockEvent.TrySwitchToAFork”,“block”:{“kind”:“Point”,“hash”:“36eb76af11cf93625c95806f9fef79de63b66d1946b0ba75411d3f6675221fbe”,“slot”:45755179}},“loc”:null,“env”:“1.31.0:2cbe3”,“msg”:"",“app”:,“host”:“core”,“at”:“2021-11-19T11:31:12.34Z”,“ns”:[“cardano.node.ChainDB”],“pid”:“1974952”}

{“thread”:“255”,“sev”:“Error”,“data”:{“val”:{“kind”:“TraceDidntAdoptBlock”,“slot”:45755179},“credentials”:“Cardano”},“loc”:null,“env”:“1.31.0:2cbe3”,“msg”:"",“app”:,“host”:“core”,“at”:“2021-11-19T11:31:12.34Z”,“ns”:[“cardano.node.Forge”],“pid”:“1974952”}

Based on the logs above we suspect the other pool block didn’t get to our BP on time and our pool tried to create block (fork the chain) with the same height but it was refused by the network due to higher slot number? Also, next block (after ours) was created 2 slots later so the pool that was creating blocks would receive both ouers and previous block almost at the same time and selected block with the lower slot number (created earlier). Does that make sense?

Here’s the link to the other pool block and its propagation delays. What we see is that there are only few relays getting that block. Not sure if that’s issue with pooltool or something else?

Was this just a slot height battle?

Markus-VITAL · 19 November 2021 20:47

This is a ghosted block.
You did not receive the previous block in time which has the same block number.
Therefore your block did not include this other block (2 slots earlier) in his chain history.
As both blocks have the same number other nodes needed to decide which block to keep.
The other block was earlier and is selected.

Markus-VITAL · 19 November 2021 20:49

This is some explanation which I gave to my delegators when the same thing happened recently to me:

Ghosted means that the block was minted but it was reverted by other pools.
A ghosted block is not generating rewards. Anyways it’s a normal happening in Cardano currently.

So what happened in detail:

Exactly one slot before our assigned slot a block was minted by another pool
This slot was not propagated (sent to other nodes) within one second (actually we received it 2 second after it was minted)
As this block was not visible to our pool, our pool minted a block which was based on prior block (and not included the one which was minted 1 second before)
As our pool does not know about the other block at this time it generates the new block with the same block number as the other block
When we are distributing our block then each pool will decide which of the 2 resulting chains to choose.
In this case the other block wins because it has the same block number but was minted earlier and reaching other pools earlier as well.

Who is responsible for this?

Generally one could say that the other pool is responsible for propagating it’s block in time. Some pools have a poorly synchronised time
In this case this was not the reason. The reason is an increased propagation time which occurs in a time-frame of 48h - 72h after Epoch switch.
During this time frame Rewards are calculated which impacts Node performance of all pools and by that increases the avg propagation time to ~1.5s
This means in any 2 slot directly after each other assignments the second block is expected to be ghosted.

How often does this happen?

The chance to have a block right after out block is 5%
The affected time frame (48h-72h) is one fifth of the Epoch time
By that, the risk for a ghosted scenario is 1%

Shibu_Gope · 19 November 2021 23:36

Thanks for the clear explanation. Will running more than 2 relay nodes in diverse location help?

Markus-VITAL · 19 November 2021 23:57

Additional relays may help reduce propagation delays. But I think in this case this would not have helped.
Take a look at the block which was actually minted by ECP:

There you can take a look at the propagation delays. The default Chart shows only delays up to 2s.
To get the full picture I downloaded the JSON and visualized it.

Propagation Delays

Almost all nodes hat a delay of >2s for this block. So you had no chance to receive it in time.

t0m3kf · 20 November 2021 08:45

Yes, exploring blocks just before and after 6521190 we can see that propagation delays was high for most of the blocks produced at that time. The time frame is

6521189  BNP41 at 12:30:51  - minted
6521190  ECP   at 12:31:08  - minted
6521190  VLHLA at 12:31:10  - minted but ghosted
6521191  BCSH  at 12:31:12  - propagation delays not collected by Pooltool

BCSH pool gets both blocks from ECP/VLHLA but since ECP was created 2 slots (seconds) earlier it adds ECP block to the chain

So the conclusion is high propagation time for blocks produced at that time plus 3 blocks created in succession shortly (2 seconds) after each other (ECP/VLHLA/BCSH) caused our block to be ghosted. Bit unfortunate timing but we accept this as it’s general network issue.

Markus-VITAL · 20 November 2021 10:47

It’s bad luck. But it will happen to ~1% of all blocks currently. Only thing that can be done is tuning propagation times and make sure time is synced through chrony.

brouwerQ · 21 November 2021 14:41

There is a ‘bug’ in the node that causes a delay in block propagation from about 48hrs into the epoch until about 72 hrs into the epoch. This is since node version 1.29.0 and isn’t resolved yet. Every epoch since then, there’re a few pools missing blocks during that timeframe. If you slot was also, that’ll be the reason and there’s little you can do about it.

github.com/input-output-hk/cardano-node

[BUG] - Propagation delay increase with 1.29.0+ with start of rewards calculation (at the 48 hour mark) for about 24 hours

opened 01:09PM - 05 Sep 21 UTC

Straightpool

bug

**Internal/External** *External* otherwise. **Area** *Other* Block propa…gation **Summary** After the 48 hour mark (validated in both epoch 288 and 289) with the start of the rewards calculation block propagation delays increase significantly for little less than 24 hours (until end of rewards calculation). **Steps to reproduce** Steps to reproduce the behavior: 1. Start block producer setup on 1.29.0 on mainnet 2. Wait on epoch switch + 48 hours 3. Wait on minted block within about 24 hours 4. Track propagation delays 5. See error (especially when you compare with propagation delays on 1.29.0 before Thu 04 Sep 2021 (21:45:00 UTC) 6. After about 24 hours the propagation delays return back to normal **Expected behavior** Propagation delays should not increase with 1.29.0 with or without a parallel rewards calculation on the node. **System info (please complete the following information):** - OS Name: Ubunto - OS Version: 20.04.1 LTS - Node version: `cardano-node 1.29.0 - linux-x86_64 - ghc-8.10 git rev 4c59442958072657812c6c0bb8e0b4ab85ce1ba2` - CLI version: `cardano-cli 1.29.0 - linux-x86_64 - ghc-8.10 git rev 4c59442958072657812c6c0bb8e0b4ab85ce1ba2` **Screenshots and attachments** _Block propagation before the 48 hour mark on 1.29.0_ ![image](https://user-images.githubusercontent.com/42584250/132128032-6c3dba67-14ac-4ac4-8b0d-65c4cb8c7f01.png) _Block producer log:_ {"thread":"111","loc":null,"data":{"val":{"kind":"TraceForgedBlock","blockPrev":"6e3aeabd954f9399a926768a1344fa49914bff15fc86220d7bb1a699200cf98e","slot":39201349,"block":"96e80c3c7b2b2a2c95136cffda062212dc1d96b91be578fb60e1128ab74ff2a4","blockNo":6201140},"credentials":"Cardano"},"sev":"Info","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20cn","pid":"156659","ns":["cardano.node.Forge"],"at":"2021-09-04T15:00:40.02Z"} {"thread":"101","loc":null,"data":{"newtip":"96e80c3c7b2b2a2c95136cffda062212dc1d96b91be578fb60e1128ab74ff2a4@39201349","chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain"},"sev":"Notice","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20cn","pid":"156659","ns":["cardano.node.ChainDB"],"at":"2021-09-04T15:00:40.04Z"} is a gap of 0.02S _Relay log:_ {"thread":"48667","loc":null,"data":{"kind":"ChainSyncClientEvent.TraceDownloadedHeader","peer":[....]{"remote":},"slot":39201349,"block":"96e80c3c7b2b2a2c95136cffda062212dc1d96b91be578fb60e1128ab74ff2a4","blockNo":6201140},"sev":"Info","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20r1","pid":"135872","ns":["cardano.node.ChainSyncClient"],"at":"2021-09-04T15:00:40.04Z"} to first {"thread":"609","loc":null,"data":{"kind":"TraceBlockFetchServerSendBlock","block":"96e80c3c7b2b2a2c95136cffda062212dc1d96b91be578fb60e1128ab74ff2a4"},"sev":"Info","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20r1","pid":"135872","ns":["cardano.node.BlockFetchServer"],"at":"2021-09-04T15:00:40.10Z"} is a gap of 0.06 seconds. _Block propagation time after the 48 hour mark on 1.29.0_ ![image](https://user-images.githubusercontent.com/42584250/132128013-2cac9555-19c7-40ea-af3c-07d575874216.png) _Block producer log:_ {"thread":"111","loc":null,"data":{"val":{"kind":"TraceForgedBlock","blockPrev":"0b0c9df3c3313197e760bafc7a9370e7be676effc7aeecde7231b4e07898c2b2","slot":39271659,"block":"26541add2818066ae41b07f8788f646bb4954202057f83841bc4a3e1a7ab8024","blockNo":6204568},"credentials":"Cardano"},"sev":"Info","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20cn","pid":"156659","ns":["cardano.node.Forge"],"at":"2021-09-05T10:32:30.15Z"} {"thread":"101","loc":null,"data":{"newtip":"26541add2818066ae41b07f8788f646bb4954202057f83841bc4a3e1a7ab8024@39271659","chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain"},"sev":"Notice","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20cn","pid":"156659","ns":["cardano.node.ChainDB"],"at":"2021-09-05T10:32:30.29Z"} is a gap of 0.14S. **This is an increase of factor 7 on the block producer.** _Relay log:_ {"thread":"48667","loc":null,"data":{"kind":"ChainSyncClientEvent.TraceDownloadedHeader","peer":{"remote":[...],"slot":39271659,"block":"26541add2818066ae41b07f8788f646bb4954202057f83841bc4a3e1a7ab8024","blockNo":6204568},"sev":"Info","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20r1","pid":"135872","ns":["cardano.node.ChainSyncClient"],"at":"2021-09-05T10:32:30.37Z"} to first {"thread":"609","loc":null,"data":{"kind":"TraceBlockFetchServerSendBlock","block":"26541add2818066ae41b07f8788f646bb4954202057f83841bc4a3e1a7ab8024"},"sev":"Info","env":"1.29.0:4c594","msg":"","app":[],"host":"ub20r1","pid":"135872","ns":["cardano.node.BlockFetchServer"],"at":"2021-09-05T10:32:30.68Z"} is a gap of 0.31seconds. **This is an increase of factor 5.1 on the relay.** _After less than 24 hours the propagation delay is back to normal:_ ![image](https://user-images.githubusercontent.com/42584250/132140088-cfcf3cc7-b061-4049-bd38-9f43f928b985.png) **Additional context** - Node restart does not fix the issue - This is not a reporting issue, other setups tracking propagation are showing the same symptoms - Throwing additional CPU cores at the nodes might help for an individual hop, this is clearly connected to rewards calculation which kicks in 48 hours after epoch start, most likely a CPU scheduling issue - This issue is related to https://github.com/input-output-hk/ouroboros-network/issues/2913 as it can lead to block collision

Markus-VITAL · 21 November 2021 22:24

Yes, and no
So this bug is causing slots to propagate in 1.5s on average during the 48h+24h window.
But in this case, the block was not propagated after 2s.

So I think trying to tune the propagation behaviour of each individual pool would help avoid (but not eliminate) that issue.

Markus-VITAL · 21 November 2021 22:52

I just published an article about how we (xSPO Alliance) think the risk of ghosted blocks could be reduced. It requires participation of any pool. So I’d be glad if many of you guys are taking a look and giving feedback on it

https://forum.cardano.org/t/avoiding-ghosted-blocks-action-required/85802

Topic		Replies	Views
Ghosted Blocks - How to optimize Propagation Times - Action Required Operate a Stake Pool	7	3355	9 May 2022
Orphaned block Operate a Stake Pool	22	1810	7 April 2021
Invalid block - no known cause Operate a Stake Pool	13	1434	16 December 2021
I missed a block today. please help me understand the logs Operate a Stake Pool	33	2670	27 October 2022
No blocks during the epoch for almost 5 minutes?! Operate a Stake Pool	41	1860	9 May 2023

Missed slot - trying to understand what happend

Related topics