Problems with using VRF to decide fork battles

Terminada · 29 September 2023 05:40

I was prompted to write this post after reading the following twitter comments by @zygomeb (who, it seems, is a fan of Radix).

Twitter @zygomeb:

The problem with Cardano’s design is that it sacrifices consensus security to bring the fully liquid staking system.

Twitter @pizzadren:

Can you elaborate on how it sacrifices consensus security? Is it because they don’t have locking or slashing mechanisms?

Twitter @zygomeb:

Yes. We’ve already seen plenty of times what lack of slashing does to the consensus, like for example today SPOs complaining about etoro making them lose blocks due to height battles as their setup is bad and should be slashed (wrong time)

Twitter @PhilippeVieLong:

Right now the consensus layer cannot know if a pool has too high of a latency, otherwise there wouldn’t be these kind of slot battles. This might be fixable with Chronos, but I’m not sure.

Twitter @PhilippeVieLong:

Does radix have slashing? Under which circumstances?

I think this problem is absolutely fixable with some changes at the protocol level and these changes won’t require any slashing mechanism or locking.

Current cardano-node implementation:

We all know a chain is a sequence of blocks. A fork occurs when there are two possible candidate chains where the terminal blocks differ following a common ancestor block. When the current implementation of cardano-node sees two alternative chains it determines which chain it prefers as follows:

Prefer the longest chain.
If chains are equal in length then prefer the chain whose terminal block has the lowest “block VRF” value.

One thing important to note with rule 2. above is the current implementation of cardano-node does not care if the terminal blocks have the same slot number or not. It applies the lowest block VRF wins rule irrespective of the slot number of each terminal block. The chain selection rule does not take into account the slot number nor any amount of delay in receiving the chain (block).

Centralisation forces

During the time of the incentivised testnet this chain selection rule was different. Instead of rule 2. above, the testnet node implementation preferred the chain that it already had. This in effect meant that the node preferred the chain it received first. This resulted in a race by stake pool operators to get their blocks out quickly in order to minimise the risk their block would be “orphaned”. Stake pool operators realised they could earn more rewards by moving their block producers to one particular data centre in Germany since this resulted in very fast block propagation because most of the rest of the cardano network of block producers were also in this same data centre. Obviously this is not what most people would consider as good for decentralisation.

Meaning of decentralisation

To properly achieve decentralisation we need not just decentralisation of cardano-node ownership but also physical decentralisation of where nodes are operated. For true decentralisation we need block producers operating across the world in lots of different countries that have lots of different legal and governmental rules / regulations. Decentralisation gives us the ability to arbitrage between different legal and governmental rules and also different physical locations. If a stupid Govt tries to limit freedom in one area then this won’t disrupt the cardano network meaningfully if we are properly decentralised. Maybe we even want some cardano block producers operating up in space via satellite links.

However, one problem with being physically decentralised is that different physical locations will have different block propagation delays.

Current problem of “fork battles” caused by network delays

Some pool operators are rightly complaining at present about stake pool operators which are causing unnecessary “fork battles” by having very large propagation delays. For example there is this twitter exchange.

Since the current implementation of cardano-node decides all equal height “fork battles” based upon the lowest block VRF wins rule, it seems that there is no penalty for poor block propagation. And, it is easy to think that we should punish the poor block producers by “slashing” them somehow if they have too much block propagation delay. However, we need to be careful to not recreate a centralisation problem.

Centralising forces with current implementation

Actually poor block propagation is currently penalised but many stake pool operators don’t appreciate this fact. Many stake pool operators also don’t realise that this penalty for poor block propagation causes a centralisation force.

My own stake pool block producer is housed in Australia, on the opposite side of the world from the majority of the cardano network. This means that my delays in sending and receiving blocks are often around 1 second and there is little I can do to shrink this further particularly when the internet is loaded. You might think that 1 second is not much, and it isn’t. However this means that my pool will receive three times the number of “fork battles” of another pool housed within the majority. This is because slots in Cardano are only 1 second in length. This means that my pool in Australia might not receive the block for the previous slot in time, or my block may not be received by the next block producer in time. In other words, with just 1 second delay, my block could result in a “fork battle” with a block produced for the previous slot, the same slot, or the next slot. Whereas a block producer in the majority should only see “fork battles” with another block produced for the exact same slot. (This of course assumes that the vast majority of the cardano network is in USA and Europe where block propagation times between operators is less than 1 second.)

On Cardano true “slot battles”, where the fork results from two pools producing a block for the same slot, occurs 5% of the time. But, the chance of another pool producing a block within 1 slot of my pool in Australia is 3 times this, or 15%. The net effect is that with 1 second delays to the majority, my block producer will suffer roughly 3 times the number of “fork battles” as a block producer in the majority. Since all fork battles are decided by the lowest block VRF wins rule then my pool will lose half of these battles resulting in these blocks being orphaned. My pool will get roughly 7.5% of it’s blocks orphaned compared to only 2.5% of blocks orphaned for pools housed with the majority in USA / Europe. My pool will not received rewards for these orphaned blocks, but more importantly, my pool will not receive a “fair” amount of control over the Cardano canonical chain as would be consistent with it’s level of controlled stake.

So what should my pool in Australia do? Well, I can move my block producer to a data centre in Germany so that my block propagation time is reduced to below 1 second. I could mitigate the problem by CENTRALISING my pool.

A simple attack

Another problem with the current implementation is that a malicious operator can easily exploit the deterministic “block VRF” selection rule to knock out blocks produced by other pools. Currently it is very easy for a node to look at the previous block’s VRF score and then determine if it’s own VRF will be lower. If it has a lower VRF then this block producer can choose to produce it’s block upon the second last block thereby deliberately producing a fork that it knows it will win. This will result in the previous block producer’s block becoming orphaned or “knocked out” of the canonical chain. The previous block producer will then lose it’s rewards for this block. This will enable a malicious group of block producers to gain more control over the canonical chain relative to their controlled stake proportion. To quantify such an attack, I wrote a simple Haskell program to model it and determined that a malicious group with only 32% of the stake, combined with the ability to delay transmission to other nodes outside their group by 2 seconds (think Amazon), could achieve over 50% of the canonical blocks.

A solution

Two changes to the protocol would provide a better solution to these problems:

1. Increase the slot duration to equal the acceptable maximum network delay.

If the slot duration is say 4 seconds then this will provide 4 seconds for each block to be propagated before another block can be possibly produced. In other words, if every block producer can keep it’s propagation delays to less than 4 seconds, then it will only receive “fork battles” with other pools where the slot is identical. This will mean that there is no advantage in housing your pool in a European data centre vs somewhere in Africa, Australia, or even in space.

If the majority of people agree that proper decentralisation also means physical decentralisation, then we should determine what is an acceptable level of block propagation delay and set the slot duration to this value. I think something like 4 seconds sounds reasonable as this value is still less than IOG’s assumed propagation security guarantee limit of 5 seconds, which has been previously stated.

2. Use the block VRF to determine “fork battles” only when the slot number is identical.

In other words, change the chain selection rule to the following:

Prefer longest chain.
If both chains are equal length, and both terminal blocks are for the same slot number, then prefer the chain whose terminal block has the lowest VRF score. (A true “slot battle”.)
If both chains are equal length, and both terminal blocks are for different slot numbers, then prefer the chain where the terminal block is least delayed from the current slot number. (This will in effect prefer the block produced for the current slot number rather than one produced for an earlier slot which has been delayed significantly.)
Reject any blocks where the slot number is greater than the current slot number. (IE. Do not allow block producers to send blocks ahead of time.) (I believe cardano-node already rejects blocks that are sent early so this rule may already be part of the protocol.)

The effect of both these changes would result in penalising a poorly run block producer if their block took longer than 4 seconds to propagate. However, it would not cause a centralising effect for well run block producers which maintain their block propagation delays to less than 4 seconds, wherever they are located in the world. These changes also negate the ability to undertake the malicious attack mentioned above.

I think it is very important to realise that when you house your block producer in a data centre run by a company like Amazon, you are in effect handing control of your block producer over to that centralised company. Amazon has the ability to selectively control the network delay to different machines in it’s data centre. Amazon also has the ability to monitor the network traffic and even access the data since it owns the hardware in it’s data centre. A major company like Amazon could easily choose to advantage a select group of block producers by delaying the propagation of competitor blocks so this select group can achieve a higher number of blocks in the canonical chain relative to their controlled stake.

HeptaSean · 29 September 2023 07:21

That is a bad example for a slashing use case. We cannot definitely prove on the protocol level if a pool is run poorly and decide to slash it.

In the case of eToro it is overwhelmingly likely since this goes on for months, but it’s nothing that I would bake or know how to bake into the protocol.

I was, however, wondering the same: How can a proof of stake protocol without slashing even make sense? The delegators do not have to care or learn if the pool they are delegating to is good, poorly run or outright an attacker. It’s totally risk-free to them. Is that really “good”?

But on the other hand, it’s surprisingly hard to find examples of cases that should unambiguously be slashed.

This prefers pools whose relays and block producers are close (in network topology) to the nodes of the other pools.

So, it might actually make running your nodes at Amazon more attractive.

Terminada · 29 September 2023 09:30

Not if the slot duration is increased to 4 seconds as I proposed.

If on the other hand, we change the chain selection algorithm to only use the block VRF to deterministically decide “fork battles” where the slot number is identical (ie. true “slot battles”), whilst we simultaneously leave the slot duration at 1 second, then you would be right. This combination would indeed produce a centralising force for everyone to house their block producer in a data centre in USA or Europe. Everyone would be needing to keep their block propagation times to less than 1 second otherwise they would suffer more orphaned blocks.

I contend that it is unrealistic to expect every block producer across the world, across all countries, and maybe even space, to be able to keep block propagation delays to less than 1 second. That is why the better solution is to do both my suggested changes:

Use the block VRF to deterministically settle fork battles only when the slot number is identical (true slot battle).
And
Increase the slot duration to the suggested 4 seconds.

Terminada · 29 September 2023 09:37

Well I suppose that depends on whether you are one of Amazon’s approved insiders. Do you happen to be friends with Jeff Bezos and share a dream of exerting control over Cardano to screw over everyone else?

Klobbinger · 29 September 2023 17:48

Is there any specific reasoning to have a 1 sec slot cadence? The problem with longer slot lengths might mean that maximum leghth between two blocks would increase. I don’t know if the VRF function could be tuned on a specific way to mitigate this.

The 32% percent attack vector is also mentioned in the paper. If the attacker can delay propagation of competitors, the threshold moves from 50% down to a minimum of 33%.

I like your suggestions, but the implications have to be analysed in detail. Best case in my opinion would be to somehow transition to fixed block times while maintaining private leader schedules.

@HeptaSean: Slashing is not needed in my opinion, but where possible malicious or sub optimal pool behavior should be punished by not rewarding the pool.

Terminada · 29 September 2023 21:28

The slot cadence is easily tuned. You can have slot duration equal to 4 seconds and still have blocks arrive on average every 20 seconds. And, the blocks will still arrive randomly causing the random periods of silence outlined in Ouroboros Praos paper. In particular this quote:

In order to cope with the ∆-semisynchronous setting we introduce the concept of “empty slots” which occur with sufficient frequency to enable short periods of silence that facilitate synchronization. This feature of the protocol gives also its moniker, “Praos”, meaning “mellow”, or “gentle”.

Out of interest, note that some time in the past Cardano’s slot duration was calibrated at a regular cadence of 20 seconds.

Klobbinger · 30 September 2023 13:38

The regular 20s cadence was before the introduction of vrf though, wasn’t it?

Tbh, I never fully understood why these periods of silence need to be empty slots. Why can’t there just be one slot every 20s? Or does Ouroboros need the slots as a means to actually facilitate synchronization?

Terminada · 30 September 2023 21:34

Yes.

That is not clear to me either. I am not sure the periods of silence need to be random in length. @Michael.Liesenfelt argues for a regular 20 second block cadence and such would provide at least 15 seconds of silence regularly after it is reasonable that everyone received the last block.

The concern I have about the random nature of the silence is that in cryptology randomness tends to make things harder for attackers in non-obvious ways. For example, if the block cadence was 20 seconds then what is the incentive to get your block out quickly? You wouldn’t care if it arrived after 15 or even 19 seconds. Maybe an attacker can exploit this fact somehow? But if you have only 4 seconds to get your block to the majority and will get penalised if another block arrives in the next slot (with slots == 4 seconds) because other nodes would then prefer the next block over yours??? Furthermore, you don’t know if another block will arrive in this next slot as it could be a silent slot, or not. I think you need a cryptology expert to properly analyse these pros and cons.

Note: The suggested changes I listed above do still provide periods of random length silence between blocks.

Terminada · 2 October 2023 02:33

As soon as you use punishment which requires locking then this incentivises the market to come up with various liquid staking derivatives (LSDs). After all, we do still want to be able to put our capital to work.

There is lots of discussion around how LSDs cause centralisation because of the tendency to cause a winner takes most outcome since the market will prefer the most liquid derivative. We see this happening with Lido on Ethereum.

However I don’t see much discussion about the tax implications for these liquid staking derivatives. I think the way the tax office is likely to see things is that you traded your Ethereum for stETH, or LidoETH, or whatever, and this has capital gains tax implications. If every time you stake, re-stake, or unstake, involves capital gains tax implications, then that is a pretty big deal breaker.

Another reason why designs that involve slashing result in unintended consequences.

Topic		Replies	Views
Understanding slot battles Education	1	757	5 August 2023
I just received a a ghosted block, what can I do to prevent this? Community Technical Support	3	451	27 May 2024
Orphaned block, three way Height Battle Operate a Stake Pool	4	424	30 October 2022
How should Cardano pool operators deal with adversarial nodes causing forks? Operate a Stake Pool stake-pools	4	841	15 January 2020
Not getting the Vasil node.counter memo - A STORY FOR OTHER SPOs Operate a Stake Pool	0	439	24 December 2022