I was prompted to write this post after reading the following twitter comments by @zygomeb (who, it seems, is a fan of Radix).
Twitter @zygomeb:
The problem with Cardano’s design is that it sacrifices consensus security to bring the fully liquid staking system.
Twitter @pizzadren:
Can you elaborate on how it sacrifices consensus security? Is it because they don’t have locking or slashing mechanisms?
Twitter @zygomeb:
Yes. We’ve already seen plenty of times what lack of slashing does to the consensus, like for example today SPOs complaining about etoro making them lose blocks due to height battles as their setup is bad and should be slashed (wrong time)
Twitter @PhilippeVieLong:
Right now the consensus layer cannot know if a pool has too high of a latency, otherwise there wouldn’t be these kind of slot battles. This might be fixable with Chronos, but I’m not sure.
Twitter @PhilippeVieLong:
Does radix have slashing? Under which circumstances?
I think this problem is absolutely fixable with some changes at the protocol level and these changes won’t require any slashing mechanism or locking.
Current cardano-node implementation:
We all know a chain is a sequence of blocks. A fork occurs when there are two possible candidate chains where the terminal blocks differ following a common ancestor block. When the current implementation of cardano-node sees two alternative chains it determines which chain it prefers as follows:
- Prefer the longest chain.
- If chains are equal in length then prefer the chain whose terminal block has the lowest “block VRF” value.
One thing important to note with rule 2. above is the current implementation of cardano-node does not care if the terminal blocks have the same slot number or not. It applies the lowest block VRF wins rule irrespective of the slot number of each terminal block. The chain selection rule does not take into account the slot number nor any amount of delay in receiving the chain (block).
Centralisation forces
During the time of the incentivised testnet this chain selection rule was different. Instead of rule 2. above, the testnet node implementation preferred the chain that it already had. This in effect meant that the node preferred the chain it received first. This resulted in a race by stake pool operators to get their blocks out quickly in order to minimise the risk their block would be “orphaned”. Stake pool operators realised they could earn more rewards by moving their block producers to one particular data centre in Germany since this resulted in very fast block propagation because most of the rest of the cardano network of block producers were also in this same data centre. Obviously this is not what most people would consider as good for decentralisation.
Meaning of decentralisation
To properly achieve decentralisation we need not just decentralisation of cardano-node ownership but also physical decentralisation of where nodes are operated. For true decentralisation we need block producers operating across the world in lots of different countries that have lots of different legal and governmental rules / regulations. Decentralisation gives us the ability to arbitrage between different legal and governmental rules and also different physical locations. If a stupid Govt tries to limit freedom in one area then this won’t disrupt the cardano network meaningfully if we are properly decentralised. Maybe we even want some cardano block producers operating up in space via satellite links.
However, one problem with being physically decentralised is that different physical locations will have different block propagation delays.
Current problem of “fork battles” caused by network delays
Some pool operators are rightly complaining at present about stake pool operators which are causing unnecessary “fork battles” by having very large propagation delays. For example there is this twitter exchange.
Since the current implementation of cardano-node decides all equal height “fork battles” based upon the lowest block VRF wins rule, it seems that there is no penalty for poor block propagation. And, it is easy to think that we should punish the poor block producers by “slashing” them somehow if they have too much block propagation delay. However, we need to be careful to not recreate a centralisation problem.
Centralising forces with current implementation
Actually poor block propagation is currently penalised but many stake pool operators don’t appreciate this fact. Many stake pool operators also don’t realise that this penalty for poor block propagation causes a centralisation force.
My own stake pool block producer is housed in Australia, on the opposite side of the world from the majority of the cardano network. This means that my delays in sending and receiving blocks are often around 1 second and there is little I can do to shrink this further particularly when the internet is loaded. You might think that 1 second is not much, and it isn’t. However this means that my pool will receive three times the number of “fork battles” of another pool housed within the majority. This is because slots in Cardano are only 1 second in length. This means that my pool in Australia might not receive the block for the previous slot in time, or my block may not be received by the next block producer in time. In other words, with just 1 second delay, my block could result in a “fork battle” with a block produced for the previous slot, the same slot, or the next slot. Whereas a block producer in the majority should only see “fork battles” with another block produced for the exact same slot. (This of course assumes that the vast majority of the cardano network is in USA and Europe where block propagation times between operators is less than 1 second.)
On Cardano true “slot battles”, where the fork results from two pools producing a block for the same slot, occurs 5% of the time. But, the chance of another pool producing a block within 1 slot of my pool in Australia is 3 times this, or 15%. The net effect is that with 1 second delays to the majority, my block producer will suffer roughly 3 times the number of “fork battles” as a block producer in the majority. Since all fork battles are decided by the lowest block VRF wins rule then my pool will lose half of these battles resulting in these blocks being orphaned. My pool will get roughly 7.5% of it’s blocks orphaned compared to only 2.5% of blocks orphaned for pools housed with the majority in USA / Europe. My pool will not received rewards for these orphaned blocks, but more importantly, my pool will not receive a “fair” amount of control over the Cardano canonical chain as would be consistent with it’s level of controlled stake.
So what should my pool in Australia do? Well, I can move my block producer to a data centre in Germany so that my block propagation time is reduced to below 1 second. I could mitigate the problem by CENTRALISING my pool.
A simple attack
Another problem with the current implementation is that a malicious operator can easily exploit the deterministic “block VRF” selection rule to knock out blocks produced by other pools. Currently it is very easy for a node to look at the previous block’s VRF score and then determine if it’s own VRF will be lower. If it has a lower VRF then this block producer can choose to produce it’s block upon the second last block thereby deliberately producing a fork that it knows it will win. This will result in the previous block producer’s block becoming orphaned or “knocked out” of the canonical chain. The previous block producer will then lose it’s rewards for this block. This will enable a malicious group of block producers to gain more control over the canonical chain relative to their controlled stake proportion. To quantify such an attack, I wrote a simple Haskell program to model it and determined that a malicious group with only 32% of the stake, combined with the ability to delay transmission to other nodes outside their group by 2 seconds (think Amazon), could achieve over 50% of the canonical blocks.
A solution
Two changes to the protocol would provide a better solution to these problems:
1. Increase the slot duration to equal the acceptable maximum network delay.
If the slot duration is say 4 seconds then this will provide 4 seconds for each block to be propagated before another block can be possibly produced. In other words, if every block producer can keep it’s propagation delays to less than 4 seconds, then it will only receive “fork battles” with other pools where the slot is identical. This will mean that there is no advantage in housing your pool in a European data centre vs somewhere in Africa, Australia, or even in space.
If the majority of people agree that proper decentralisation also means physical decentralisation, then we should determine what is an acceptable level of block propagation delay and set the slot duration to this value. I think something like 4 seconds sounds reasonable as this value is still less than IOG’s assumed propagation security guarantee limit of 5 seconds, which has been previously stated.
2. Use the block VRF to determine “fork battles” only when the slot number is identical.
In other words, change the chain selection rule to the following:
- Prefer longest chain.
- If both chains are equal length, and both terminal blocks are for the same slot number, then prefer the chain whose terminal block has the lowest VRF score. (A true “slot battle”.)
- If both chains are equal length, and both terminal blocks are for different slot numbers, then prefer the chain where the terminal block is least delayed from the current slot number. (This will in effect prefer the block produced for the current slot number rather than one produced for an earlier slot which has been delayed significantly.)
- Reject any blocks where the slot number is greater than the current slot number. (IE. Do not allow block producers to send blocks ahead of time.) (I believe cardano-node already rejects blocks that are sent early so this rule may already be part of the protocol.)
The effect of both these changes would result in penalising a poorly run block producer if their block took longer than 4 seconds to propagate. However, it would not cause a centralising effect for well run block producers which maintain their block propagation delays to less than 4 seconds, wherever they are located in the world. These changes also negate the ability to undertake the malicious attack mentioned above.
I think it is very important to realise that when you house your block producer in a data centre run by a company like Amazon, you are in effect handing control of your block producer over to that centralised company. Amazon has the ability to selectively control the network delay to different machines in it’s data centre. Amazon also has the ability to monitor the network traffic and even access the data since it owns the hardware in it’s data centre. A major company like Amazon could easily choose to advantage a select group of block producers by delaying the propagation of competitor blocks so this select group can achieve a higher number of blocks in the canonical chain relative to their controlled stake.