Ghosted Blocks - How to optimize Propagation Times - Action Required

Markus-VITAL · 23 November 2021 15:40

Introduction
A couple of SPO’s in the xSPO alliance has been analyzing several ghosted block scenarios. This post intends to generate awareness about the connection between ghosted blocks and Block propagation times and put focus on the need for optimizing propagation times.

Ghosted blocks can be caused by suboptimal pool setups and penalize all operators, making them loose blocks. All operators are urged to take a look at their pool’s propagation times and take optimization steps.

A short explanation of a typical ghosted block scenario:
Pool A mints a block.
Then Pool B mints a block in the next slot
Because it was directly subsequent blocks Pool A’s block is not known to Pool B at this time.
Pool B’s block will get the same block number and will be ghosted.

This means Pool B can be impacted by the propagation delays of Pool A.
More details on a scenario and mitigation steps are found in the next post.

Network stability and IOG view
While IOG considers 5s propagation time as an acceptable limit it’s still valuable to optimize propagation times to run into this limit as late as possible. I understand this limit as a maximum that still can be considered as reliable network operation. My assumption is that a 5s average propagation time would cause ~25% ghosted blocks.

Current IOG Blog Article: Slow and steady wins the race: network evolution for network growth - IOHK Blog

The article describes that the block size will be increased in small steps which will further increase propagation times. This will make this topic even more important.

Markus-VITAL · 23 November 2021 15:41

The previous post was just the introduction. Let’s get into some more details:

Delimitation
We are not talking about stolen blocks here. Stolen blocks are part of the Cardano/Ouroboros Design and are constantly happening. There is nothing to be concerned about as they are not increasing over time.

What is a ghosted block?
A ghosted block is a block that was minted but rejected by another node. There are multiple potential reasons for this. Here we are purely focussing on one specific scenario.

Scenario

Example - Expected Sequence:

Each pool that forges a block is appending it to the chain, so it adds it to the previous block. In this example, everything is working as expected.

Slot	Forged Block	Previous block
45881000	Block A (by Pool x)	some other
…
45881004	Block B (by Pool y)	Block A
45881005
45881006	Block C (by Pool z)	Block B

Example - Ghosted block

Now let’s assume that in the above scenario the propagation (time until other pools know about it) is >2s. Pool z would not know about Block B when minting Block C. Both of those blocks are getting the same block number. Only one block number (in this Case Block B) can survive. Block C would be reverted because there is already a block with that same number which was minted in an earlier slot.
Note: This is the most commonly observed behavior. There are other scenarios that we are not considering in this article.

Slot	Forged Block	Previous block	Result	Explanation
45881000	Block A (by Pool x)	some other	adopted
…
45881004	Block B (by Pool y)	Block A	adopted	This block is propagating slowly (>2s)
45881005
45881006	Block C (by Pool z)	Block A	ghosted	Pool z does not know about Block B at this point in time and therefore used Block A as the previous.

As this example shows, a bad propagation time of Block B (by Pool y) is causing Block C (by Pool z) to be ghosted.

Impact
One block less is forged. No Rewards are generated for the delegators of Pool z for Block C while the operator of Pool z cannot do anything about propagation delays from Pool y.

For this reason, this issue requires education and a common effort to optimize propagation delays.

Ghosted blocks decrease network density. Currently, we expect ~1.3% of blocks to be ghosted. This number would increase drastically if the density would be increased for scalability reasons.

Reasons and Mitigations

	Reason	Mitigations
Bad Propagation Time	Too Few In Connections on own relays causing latencies through more hops to propagate across world	Validate Prop Delays; Multiple Relays; Geo Distributed Relays; Topology updater; In Connections
Delayed Forge	Low CPU / Blocked CPU. If the BP’s CPU is blocked at the minting time the forge is delayed and introduce latency	Validate forge timing; Analyze missed slots; Analyze slot timing
Bad Timing	System Clock of BP is not running in sync and the block is minted at a wrong timing (too early or late)	Use Chrony

SPO Check List to avoid high Propagation Delays
Time Synchronization

To avoid that your blocks are minted at a wrong time

Install and configure Chrony (Stake Pool (Server) Time Synchronisation with Chrony - YouTube)

Improve Topology

To reach the global distribution of your block as fast as possible it’s important that the number of hops required to reach all pools is as low as possible. To achieve this the following aspects can help

Run multiple geo-distributed Relays
Have 20+ In connection on each Relay - You ask pools you know to add you as peers manually in their topology. Having some custom connections is anyways good practice for a scenario where topology updater would be out of service you would lose IN Connections and therefore lose the chance to propagate your blocks.
NOTE: Block propagation is based on a pull mechanism. So IN Connections mean. Some other pools are fetching blocks from your pool.

Node Configuration

To avoid delays in forging your block

Minimum number of missed slots during epoch (excluding epoch transition). Preferably no missed slots
Avoid enabling TraceMemPool setting, as you don’t actually need it to operate your pool. Enabling this increases CPU load.
(No more missed slots (during epoch) after changing one setting)

Validation

No missed slots during epoch (excluding epoch transition)
Propagation delays during Epoch < 1s
Propagation delays during rewards calculation (48h + 24h) < 2s

If you are not able to reach those validation aspects you may consider improving your hardware and/or infrastructure. Highly overleveraged VPS providers will cause missed slots (when the CPU is blocked from someone else). For bad propagation times, improved CPU performance helps to avoid latency until the block is available to get synchronized.

Markus-VITAL · 26 November 2021 20:42

To get this topic a little more tangible I took a look at the Epoch 303 Stats from PoolTool.
Please note that not all of the data is 100% representative as pools with a low number of blocks could suffer from low sample data.

Propagation Time Analysis
PoolTool is collecting Data about Propagation Latencies. The data is collected from nodes that are reporting Data to Pooltool. It shows data from a receiver and a producer perspective.

Producer Data is available for all Pools which at least minted one block. This also includes pools that are not reporting data itself because the according numbers are collected from the pools which are receiving this block.

Receiver Data is only available for Pools that report Data themselves.

For our analysis, we took a look at the Producer Data because this is simply available for moor pools. Please consider that data for small pools is getting less accurate because of the low sample number. On the other hand, the results of pools that are generating a low number of blocks are also less important for the overall network health.

So let’s dig into the different observations:

#1 Propagation times outside a ± 5 s window

There is a total average of 9.5% of reported propagation times which is >5s. This means that the time when some receiving pool (which reports to PoolTool) is getting the block the time is either >5s after the scheduled slot is >5s still in the future.

It’s not plausible that a propagation really takes longer than 5s as the average currently is ~750ms. So the most probable assumption is that some of the receiving pools are either far off time or reporting invalid data to PoolTool.

#1a Pools with an above-average percentage of times outside a ±5 s window

The average of delays outside ±5s is 9.5%. For some pools, this metric is extraordinarily high. I set a tolerance of 10.5% on this report with the following results:

111 pools
61 >10.5%
24 >11%
10 >12%
4 >15%
7 >20%
5 >50%

Why those pools are showing higher numbers is unclear. I don’t see a specific reason why for some blocks the number should be much higher than for others. Therefore I assume that those pools are the ones that are partly causing the 9.5% of reported times for other pools.

#2 Blocks which are distributed before they even are scheduled

About 3% of the analyzed pools are distributing blocks ahead of time. This shows up as negative propagation times on PoolTool. The average across all pools is 0.2%. Please note that this early reception also can be caused by a wrong time on the receiver that’s why we set a 1% threshold as acceptable for this measure. Concretely the 3% include all pools where the data shows at least 1% negative times excluding the times which are below -5s

The only logical reason for this phenomenon is misconfigured time on that 3% of the pools which are running ahead of time.

#3 Pools with high average propagation times

The example above shows a pool with high average propagation times. The global average is ~750 ms. The analyzed data shows the following distribution of average times:

1689 pools (which were not filtered out above by anomalies)

491 (29.1%) < 500 avg
916 (54.2%) < 1000 avg
184 (10.9%) < 1500 avg
98 ( 5.8%)>= 1500 avg

Bad average times can be caused by a low number of IN connections, which cause additional hops until the block distributes across all nodes. It could also be caused by poor internet connections, e.g. for home-operated nodes.

On the other hand, it also can be caused by using a hosting location that is not optimally connected to the rest of the world. This by the way is not really a bad thing as decentralization also implies distributing nodes worldwide. Anyways considerations like accelerated network connections could help in such scenarios.

Markus-VITAL · 28 November 2021 21:35

Quick-Fix Chrony
Based on the analysis above the most important fix still seems to be a correct time synchronization.
Unfortunately, it seems most Setup Guides are not considering using Chrony as a mandatory step or prerequisite.

If you did not consider it already please set up chrony. The Coincashew gives some instructions about how to do this:

I really think this should be added as a mandatory step to all guides!

Markus-VITAL · 30 November 2021 11:15

In guild operators a new “Manual Pre-Requirement” Section now lists the need for proper time synchronization among a few other frequently missed topics: Basics - Guild Operators

Eystein_Hansen · 5 May 2022 17:16

Small extra tip: Get your own timesynch server with gps antenna. You can find some cheaper solutions on ebay for 100 euro or so all the way up to 1000’s of euro or usd… ANP ADDS TIMESERVER – ADA North Pool I cut my timesynch from 20ms to under 1ms with this myself. This way you are actually also more decentralized in that you are not depending on a 3rd party timesynch service only.

7.4d4 · 6 May 2022 03:51

Sorry, I hadn’t noticed this thread before. @Markus-VITAL

I would like to correct the record on one point:

First some terminology so everyone is on the same page:

“Slot battles” are where two pools are slot leaders for the same slot and both produce a block. This produces a fork in the chain. Only 1 block (fork) is accepted by the network and one gets “orphaned”.
The problem talked about in this thread could be termed “propagation delay battles” because two (or more) pools produce a block with the same block number and maybe many of the same transactions. Again, this produces a fork in the chain.

The correction I would like to make is in terms of how the protocol resolves which fork (block) is accepted. If we assume that the fork is only 1 block high when each node receives the two alternatives, then in both cases the protocol resolves the issue by picking the block with the lowest VRF score.

Now this can change if one of the forks gets added onto by another pool’s block before the majority of nodes receive the other fork. In that case the longest chain wins rule applies.

In other words, assuming that the block received last (or with the later slot) will be the one orphaned is not correct.

In particular:

Is not correct. The winning block will be the one with the lower VRF score independent of whether it was received before or later. In the example above, it is often the case that Block C wins and instead Block B is ghosted.

For more information:

github.com/input-output-hk/ouroboros-network

Consensus should favor expected slot height to ward off delay attack

opened 05:12PM - 30 Jan 21 UTC

Straightpool

**Internal/External** *External* **Summary** When a pool produces or propa…gates a block late so the block collides with the block of the next slot leader, only the vrf value is evaluated to determine the winning block, which is on its own the correct strategy deciding randomly between competitive slots. Due to the current logic in the case of delayed blocks it does happen that the block of the next slot leader which was properly propagated and produced on-time is lost due to the misconfiguration of the prior slot leader. This can be seen as a form of attack from the viewpoint of the on-time pool. Similarily, a later slot leader could produce his block multiple seconds earlier and collide with the previous block, if his vrf value was lower he could attack the previous block leader as his early block would make it on chain, the on-time block of the prior slot leader would be lost. We do not see this type of attack yet, as this would be a conscious effort, right now this attack is most likely without malice just out of misconfiguration. **Steps to reproduce** Steps to reproduce the behavior: 1. Wait on a situation where there are two slots with only a few seconds "x" apart 2. Delay production of first block by "x" seconds on first slot leader 3. Produce second block on second slot leader on-time 4. Wait until the block of the first slot leader has the lower vrf value 5. Observe that the block of the first slot leader makes it on chain, the block of the second slot leader is lost (had both blocks be on-time, both blocks would have made it on chain) **Expected behavior** The consensus protocol should evaluate the slot of the blocks and favor the block group which is expected in the current time frame. With *expected* I refer to the exact block slot height. The algorithm can calculate precisely which slot# a block at this exact moment in time should have. If there is more than one block in that group of "on-time" blocks only then the lower vrf should decide the winner. The block of the pool which produced the block on-time and propagated the block swiflty should not be attackable by a prior slot leader who delays his blocks accidently or on purpose or by a following block leader who produces his block multiple seconds earlier by modifying the system time on purpose as we have seen on the ITN as a tactic to win competitve slots. **System info (please complete the following information):** - OS: Ubunto - Version 20.04 LTS - Node version: cardano-node 1.25.1 - linux-x86_64 - ghc-8.10 git rev 9a7331cce5e8bc0ea9c6bfa1c28773f4c5a7000f **Screenshots and attachments** ![2021-01-30 17 57 49](https://user-images.githubusercontent.com/42584250/106362733-b3b6b480-6324-11eb-809b-7f6dd1d6e45b.jpg) See epoch 244: https://pooltool.io/pool/000006d97fd0415d2dafdbb8b782717a3d3ff32f865792b8df7ddd00/orphans This is the propagation delay of the slot leader before my block: ![2021-01-30 17 59 21](https://user-images.githubusercontent.com/42584250/106362778-f4aec900-6324-11eb-8ad0-8f42d449b946.jpg) See propagation delays of the pool before my block here: https://pooltool.io/pool/59d12b7a426724961607014aacea1e584f3ebc1196948f42a10893bc/blocks This is the hash of the winning late block which made it on chain: ca40eed5fd46f76fbf64e17a98808f098363a83dfe8c100046947505baa1e406 My block made it into the orphan list on pooltool, hash: 97abb258f15995688bdacdc75a054883b22471451026f409a967028ec7b30316 This is a log excerpt from my block producer, the block which should have been the parent for my block arrived full 4 seconds late: {"at":"2021-01-28T07:16:47.00Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"97abb258f15995688bdacdc75a054883b22471451026f409a967028ec7b30316@20251916"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"} {"at":"2021-01-28T07:16:48.04Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"ca40eed5fd46f76fbf64e17a98808f098363a83dfe8c100046947505baa1e406@20251913"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"} **This is the 2nd time I have observed this, last time was on December 21st, same pattern different slot leader:** Block producer log. {"at":"2020-12-20T03:07:09.01Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"78f0c4a29a9c2b9a628584066f05ba3285f6b7eaac3bc270e353f52a0fa94a8c@16867338"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobat","sev":"Notice","thread":"49"} {"at":"2020-12-20T03:07:10.64Z","env":"1.24.2:400d1","ns":["cardano.node.ChainDB"],"data":{"kind":"TraceAddBlockEvent.SwitchedToAFork","newtip":"2c237fded6c534200814d991deccc3c99f0a1bae01e603e743d6d5926e8a4519@16867333"},"app":[],"msg":"","pid":"582044","loc":null,"host":"foobar","sev":"Notice","thread":"49"} 78f0c4a29a9c2b9a628584066f05ba3285f6b7eaac3bc270e353f52a0fa94a8c was my block which was orphaned 2c237fded6c534200814d991deccc3c99f0a1bae01e603e743d6d5926e8a4519 was the hash of the block before mine (5 slots before) arriving 6 seconds late. Mike downloaded the json of one of the blocks of the pool before mine and noticed a delay of about 10 seconds back then: {"height": 5100112, "slot": 16870897, "theoretical": 1608437188000, "tiptiming": [10547, 10416, 10440, 10509, 10350, 10099, 10432, 10428, 10333, 10378, 10427, 10548, 10219, 10111, 10362, 10293, 10350, 10281, 10296, 10410, 10461, 10419, 10484, 10343, 10350, 10485, 10347, 10330, 10530, 10592, 10327, 10290, 10373, 10332, 10192, 10288, 10390, 10375, 10392, 10301, 10369, 10457, 10350, 10439, 10354, 10493, 10323, 10503, 10407, 10337, 10343, 10398, 10442, 10359, 10367, 10325, 10334, 10305, 10499, 10369, 10346, 10231, 10369, 10311, 10317, 10420, 10505, 10303, 10240, 10310, 10560, 10350, 10360, 11098, 10410, 10310, 10310, 10280, 10320, 10563, 10370, 10330, 10280, 10120, 10400, 10310, 10350, 10310, 10340, 10490, 10460, 10380, 10540, 10410, 10340, -1608437188000, 10330, 10290, 10340, 10370, 10420, 10310, 10260, 10320, 10380, 10440, 10380, 10370, 10350, 10420, 10270, 10517, 10560, 10360, 10110, 10410, 10380, 10300, 10420, 10440, 10390, 10640, 10580, 10580, 10550, 10280, 10740, 10400, 10580, 10380, 10380, 10420, 10380, 10400, 10320, 10370, 10360, 10450, 10300, 10500, 10340, 10410, 10320, 10300, 10550, 10360, 10410, 10320, 10350, 10400, 10350, 10240, 10630, 10370, 10457, 10350, 10330, 10340, 10530, 10280, 10320, 10737, 10310, 10300, 11560, 10479, 10360, 10290, 10430, 10380, 10280, 10360, 10330, 10410, 10310, 10380, 10320, 10320, 11710, 10320, 10310, 10340, 25580, 10450, 10400, 10320, 10440, 11766, 10390, 10310, 12846, 10320, 10320, 12740, 12500, 12952, 13053, 18000, 20610, 20610, 24800], "histogram": "[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

However, I do agree with this thread in general, in that everything should be done to try to minimise propagation delays, because forks are bad. They reduce the security and increase the time to finality. But, we still want true decentralisation which means that people can run nodes all over the world including in areas where internet connection is not great.

IOG are on a mission to increase the data throughput of the chain and they seem to be intent on “spending” that 5 second propagation delay budget. I think this means that propagation times are going to increase until we are pushing towards this 5s limit. So, I guess we need to prepare ourselves for more “propagation delay battles”.

Hopefully, “input endorsers” will fix this.

Markus-VITAL · 9 May 2022 06:33

Hi @7.4d4 !

Thanks for the correction! My understanding back at that time was not 100% complete
We had a similar chat over there: Understand the Risk of Slot Battles and Ghosted Blocks which I think already brought us on the same page.

So for everyone catching up here. Please take a look on the other thread as well. It might be interesting for you to understand the current impact of propagation times.

I really hope that pipelining and input endorsers will drastically reduce “the problem”

Topic		Replies	Views
Why all my blocks are ghosted? Operate a Stake Pool	9	1463	16 December 2021
Understand the Risk of Slot Battles and Ghosted Blocks Operate a Stake Pool	12	2051	31 March 2022
Missed Blocks Operate a Stake Pool	9	617	3 August 2021
Pool deliberately (?) producing empty blocks Operate a Stake Pool	10	1099	4 May 2022
Evaluation of Propagation Times Ranking - Peer Review wanted Operate a Stake Pool	3	399	30 March 2022

Ghosted Blocks - How to optimize Propagation Times - Action Required

Related topics