Your P2P experience

Hello there o/

I’ve tried to configure my BP + 1 relay in P2P mode, and left another relay in standard topoplogy mode.

→ I found out the connection between my BP (P2P) and my relay (non P2P) was dropping sometimes, and not coming back until i restarted cardano process on the Relay.

Then i reverted my BP to non-P2P mode, and left just 1 relay in P2P mode

→ The connection between my BP (non P2P) and my relay (P2P) drops sometimes as well.

So i’m wondering if this “hybrid” mode is really a good idea ? Isn’t it better to go full P2P or full non-P2P ? Any experience from other stakepool operators ?

What version of the cardano node are you running?

Latest 1.35.7

I think there was a little “bug” in 1.35.6 where you lost connection between bp and p2p relay but this is fixed with 1.35.7.

I had the same issues but as i said, its fixed now after i updated to 1.35.7…

anyway i think right now its recommanded to have only 1 relay running with p2p.

Yes i know, but maybe there are SPO out there who are running full P2P : )

Well for the moment i’m staying with just 1 relay in P2P, and my BP + other relay in standard topology mode.

Maybe @ChrisSTR8 can help you out as hes running all his relays in p2p.

The block producer should not be run in P2P mode, as P2P mode has no support for failover whatsoever, I also see no urgent reason (out of using the newer network stack) to run the BP in P2P mode, it should only talk to its own relays anyway. Personally, I only tested the block producer in non-P2P mode because of this.

Especially, in mixed mode, when non-P2P and P2P-nodes are used, you want node 1.35.7 on the non-P2P nodes. From my testing, even with 1.35.6, if you have mixed mode and the BP loses connection to the P2P relay, if the second non-P2P relay also drops out for whatever reason, the BP will re-connect to the P2P relay and hold on for dear life. So even with 1.35.6, I never saw any serious risk for block propagation. In all my early usage of P2P in mainnet, I have not lost a single block out of hundreds of blocks by now.

For a number of epochs now, I successfully ran my BP in non-P2P mode and my two main relays in P2P mode. Outside of epoch switches, the connection between BP and P2P mode is very stable, if the connection drops to one P2P relay it will recover very quickly.

It is not recommended officially yet, to run all relays on P2P mode as IOG intends to do a gradual roll-out. The recommended setup is BP and one relay in non-P2P mode, if you got an extra relay run this one in P2P mode.

In my testing, propagation has never been better outside of P2P, I have not seen one single race condition after toggling on P2P on my relays. Race conditions are rather rare anyway, so I might still see some of course, but metrics still look better enough to me and are enough reason to keep P2P engaged on my primary relays for good. There is no going back for me for main relays.

P2P mode enabled does require a bit more memory though, so 16GB is a bit tight indeed, especially around epoch switches. I am looking forward to re-evaluate this on the one P2P relays on 16GB memory with the upcoming node 8.x which is said to be fully compatible with GHC 9.2 which compiler should have superior garbage collection.

For now on a P2P relay with 16GB memory I recommend restarts after epoch switch every 5 days to be on the safe side.

2 Likes

Hello @ChrisSTR8 and thank you for your message !

I’m staying with my BP and 1 relay in non-P2P, and my other in P2P mode. Connection drops have stopped so far (for whatever reasons). All my nodes have 32GB memory so i have not experienced any problem since i started my pool, but i can’t wait for a new release with a more recent version of GHC indeed !

1 Like

What do you mean by this? Do you mean that P2P enables full-duplex connections and therefore removes the ability to selectively firewall the pulling of blocks from the main producer vs a backup?

What race conditions did you see before?

Yes, in P2P mode the firewall trick won’t work for failover anymore, so you’ll produce forks all over the place with a failover bp.

Race conditions, I have yet to see since enabling P2P on my relays are blocks scheduled on my block producer with a leader slot of another pool one slot / second before or after my pools slot. In these cases, I have seen lost blocks (height battles) a number of times due to the latency of the relay network as a whole. In the last 60 epochs the longest stretch of epochs I had none of these was 9 epochs. Now I am about to complete the 11th epoch without race conditions. Most of these 11 epochs P2P was enabled. Still early to say, but I see a positive trend.

2 Likes

OK, I understand what you mean now. Though, it is not really a race condition because the outcome is not determined by first arrival.

I call these all “fork battles”. So “slot battles” are “fork battles” where the slot is identical. Fork battles where the slots differ happen more often for me since my pool is Australia where I suffer increased network delays to everyone else in Europe and USA. This means that if my block producer produces a block within 1 second of another pool, it will often result in a “fork battle” and my pool will lose half of those battles. The more “fork battles” your pool gets involved in, whether slot is identical or not, the more orphaned blocks you end up with.

To put some numbers on this problem: Let’s say your pool is the one small pool sitting out there on it’s own with 1 second network delays and everyone else is running their pools in one major data centre where their network delays are well less than 1 second. The large majority will only see “fork battles” with each other 5% of the time and each will lose half. So the majority will lose 2.5% of blocks. But your tiny remote pool will end up facing “fork battles” 15% of the time and so you will lose 7.5% of your blocks. This is because you will get “fork battles” not just with block producers awarded the same slot, but also those awarded the slot before, and the slot after.

This is an unfairness in the protocol design which unfortunately incentivises centralisation. This is because, the fix for it is for me to warehouse my stake pool in a data centre in Germany. If everyone does the same then there will be less “fork battles”. But this is not good for decentralisation.

If we are serious about decentralisation then stake pools should be run just as geographically decentralised as the owner of the tokens are.

What makes this problem even worse is that all “fork battles” are settled deterministically based on the block VRF. This opens up an attack vector, because a malicious stake pool now has the opportunity to look at the previous block’s VRF value and then decide to deliberately make a fork if his block will have a lower VRF score. He already knows he will win this fork battle with his lower block VRF, so creating such forks will give him a higher percentage of rewards through orphaning more blocks produced by other pools.

I argue that it is a bad idea to settle all “fork battles” deterministically by the block VRF. I argue that only forks where the slot is identical should be settled by the block VRF (IE. true “slot battles”).

I am writing a CIP about both these issues:

  • Network delays causing unfairness through more physically remote pools suffering increase “fork battles”. A solution is to increase slot duration to say 4 seconds. This is still below what the protocol targets as the upper bound (5 seconds).
  • Deterministic settling of “fork battles” based on VRF. This deterministic settlement should be limited to only when the slot is identical. Fork battles where the slots differ should be settled by each node preferring the block it already has (IE. first block wins). This is a safer option because it removes the attack vector I mentioned above.

The combination of the two suggested changes would make the protocol fairer, better incentivise decentralisation, and make it more resistant to attack.

This is not just about fair distribution of rewards. This is also about how many blocks each pool gets into the final chain needs to properly reflect the stake distribution. Because: That, is a fundamental security assumption of the Ouroboros protocol.

2 Likes