Ghosted Blocks - How to optimize Propagation Times - Action Required

Introduction
A couple of SPO’s in the xSPO alliance has been analyzing several ghosted block scenarios. This post intends to generate awareness about the connection between ghosted blocks and Block propagation times and put focus on the need for optimizing propagation times.

Ghosted blocks can be caused by suboptimal pool setups and penalize all operators, making them loose blocks. All operators are urged to take a look at their pool’s propagation times and take optimization steps.

A short explanation of a typical ghosted block scenario:
Pool A mints a block.
Then Pool B mints a block in the next slot
Because it was directly subsequent blocks Pool A’s block is not known to Pool B at this time.
Pool B’s block will get the same block number and will be ghosted.

This means Pool B can be impacted by the propagation delays of Pool A.
More details on a scenario and mitigation steps are found in the next post.

Network stability and IOG view
While IOG considers 5s propagation time as an acceptable limit it’s still valuable to optimize propagation times to run into this limit as late as possible. I understand this limit as a maximum that still can be considered as reliable network operation. My assumption is that a 5s average propagation time would cause ~25% ghosted blocks.

Current IOG Blog Article: Slow and steady wins the race: network evolution for network growth - IOHK Blog

The article describes that the block size will be increased in small steps which will further increase propagation times. This will make this topic even more important.

3 Likes

The previous post was just the introduction. Let’s get into some more details:

Delimitation
We are not talking about stolen blocks here. Stolen blocks are part of the Cardano/Ouroboros Design and are constantly happening. There is nothing to be concerned about as they are not increasing over time.

What is a ghosted block?
A ghosted block is a block that was minted but rejected by another node. There are multiple potential reasons for this. Here we are purely focussing on one specific scenario.

Scenario

Example - Expected Sequence:

Each pool that forges a block is appending it to the chain, so it adds it to the previous block. In this example, everything is working as expected.

Slot Forged Block Previous block
45881000 Block A (by Pool x) some other
45881004 Block B (by Pool y) Block A
45881005
45881006 Block C (by Pool z) Block B

Example - Ghosted block

Now let’s assume that in the above scenario the propagation (time until other pools know about it) is >2s. Pool z would not know about Block B when minting Block C. Both of those blocks are getting the same block number. Only one block number (in this Case Block B) can survive. Block C would be reverted because there is already a block with that same number which was minted in an earlier slot.
Note: This is the most commonly observed behavior. There are other scenarios that we are not considering in this article.

Slot Forged Block Previous block Result Explanation
45881000 Block A (by Pool x) some other adopted
45881004 Block B (by Pool y) Block A adopted This block is propagating slowly (>2s)
45881005
45881006 Block C (by Pool z) Block A ghosted Pool z does not know about Block B at this point in time and therefore used Block A as the previous.

As this example shows, a bad propagation time of Block B (by Pool y) is causing Block C (by Pool z) to be ghosted.

Impact
One block less is forged. No Rewards are generated for the delegators of Pool z for Block C while the operator of Pool z cannot do anything about propagation delays from Pool y.

For this reason, this issue requires education and a common effort to optimize propagation delays.

Ghosted blocks decrease network density. Currently, we expect ~1.3% of blocks to be ghosted. This number would increase drastically if the density would be increased for scalability reasons.

Reasons and Mitigations

Reason Mitigations
Bad Propagation Time Too Few In Connections on own relays causing latencies through more hops to propagate across world Validate Prop Delays; Multiple Relays; Geo Distributed Relays; Topology updater; In Connections
Delayed Forge Low CPU / Blocked CPU. If the BP’s CPU is blocked at the minting time the forge is delayed and introduce latency Validate forge timing; Analyze missed slots; Analyze slot timing
Bad Timing System Clock of BP is not running in sync and the block is minted at a wrong timing (too early or late) Use Chrony

SPO Check List to avoid high Propagation Delays
Time Synchronization

To avoid that your blocks are minted at a wrong time

Improve Topology

To reach the global distribution of your block as fast as possible it’s important that the number of hops required to reach all pools is as low as possible. To achieve this the following aspects can help

  • Run multiple geo-distributed Relays
  • Have 20+ In connection on each Relay - You ask pools you know to add you as peers manually in their topology. Having some custom connections is anyways good practice for a scenario where topology updater would be out of service you would lose IN Connections and therefore lose the chance to propagate your blocks.
    NOTE: Block propagation is based on a pull mechanism. So IN Connections mean. Some other pools are fetching blocks from your pool.

Node Configuration

To avoid delays in forging your block

Validation

  • No missed slots during epoch (excluding epoch transition)
  • Propagation delays during Epoch < 1s
  • Propagation delays during rewards calculation (48h + 24h) < 2s

If you are not able to reach those validation aspects you may consider improving your hardware and/or infrastructure. Highly overleveraged VPS providers will cause missed slots (when the CPU is blocked from someone else). For bad propagation times, improved CPU performance helps to avoid latency until the block is available to get synchronized.

4 Likes

To get this topic a little more tangible I took a look at the Epoch 303 Stats from PoolTool.
Please note that not all of the data is 100% representative as pools with a low number of blocks could suffer from low sample data.

Propagation Time Analysis
PoolTool is collecting Data about Propagation Latencies. The data is collected from nodes that are reporting Data to Pooltool. It shows data from a receiver and a producer perspective.

Producer Data is available for all Pools which at least minted one block. This also includes pools that are not reporting data itself because the according numbers are collected from the pools which are receiving this block.

Receiver Data is only available for Pools that report Data themselves.

For our analysis, we took a look at the Producer Data because this is simply available for moor pools. Please consider that data for small pools is getting less accurate because of the low sample number. On the other hand, the results of pools that are generating a low number of blocks are also less important for the overall network health.

So let’s dig into the different observations:

#1 Propagation times outside a ± 5 s window

There is a total average of 9.5% of reported propagation times which is >5s. This means that the time when some receiving pool (which reports to PoolTool) is getting the block the time is either >5s after the scheduled slot is >5s still in the future.

It’s not plausible that a propagation really takes longer than 5s as the average currently is ~750ms. So the most probable assumption is that some of the receiving pools are either far off time or reporting invalid data to PoolTool.

#1a Pools with an above-average percentage of times outside a ±5 s window

The average of delays outside ±5s is 9.5%. For some pools, this metric is extraordinarily high. I set a tolerance of 10.5% on this report with the following results:

111 pools
61 >10.5%
24 >11%
10 >12%
4 >15%
7 >20%
5 >50%

Why those pools are showing higher numbers is unclear. I don’t see a specific reason why for some blocks the number should be much higher than for others. Therefore I assume that those pools are the ones that are partly causing the 9.5% of reported times for other pools.

#2 Blocks which are distributed before they even are scheduled

About 3% of the analyzed pools are distributing blocks ahead of time. This shows up as negative propagation times on PoolTool. The average across all pools is 0.2%. Please note that this early reception also can be caused by a wrong time on the receiver that’s why we set a 1% threshold as acceptable for this measure. Concretely the 3% include all pools where the data shows at least 1% negative times excluding the times which are below -5s

The only logical reason for this phenomenon is misconfigured time on that 3% of the pools which are running ahead of time.

#3 Pools with high average propagation times

The example above shows a pool with high average propagation times. The global average is ~750 ms. The analyzed data shows the following distribution of average times:

1689 pools (which were not filtered out above by anomalies)

491 (29.1%) < 500 avg
916 (54.2%) < 1000 avg
184 (10.9%) < 1500 avg
98 ( 5.8%)>= 1500 avg

Bad average times can be caused by a low number of IN connections, which cause additional hops until the block distributes across all nodes. It could also be caused by poor internet connections, e.g. for home-operated nodes.

On the other hand, it also can be caused by using a hosting location that is not optimally connected to the rest of the world. This by the way is not really a bad thing as decentralization also implies distributing nodes worldwide. Anyways considerations like accelerated network connections could help in such scenarios.

Quick-Fix Chrony
Based on the analysis above the most important fix still seems to be a correct time synchronization.
Unfortunately, it seems most Setup Guides are not considering using Chrony as a mandatory step or prerequisite.

If you did not consider it already please set up chrony. The Coincashew gives some instructions about how to do this:

I really think this should be added as a mandatory step to all guides!

1 Like