Topology between Relays and Blockproducers in a High Availability Scenario

CardanoCafe · 19 March 2021 23:07

I would like to test a high availability scenario. I’m not sure if that’s the correct plan, but I thought I should give it a try:

Let’s imagine a scenario with one active and one hot-standby backup block producer node. (Blockproducer A and Backup-Blockproducer B).

Let’s imagine my relays control automatically the correct blockproducer IP address they connect to: either Blockproducer A or Backup-Blockproducer B.
If Blockproducer A goes down, then automatically all Relays connect to Backup-Blockproducer B. I haven’t tested yet but theoretically this automatic switch-over should work using Virtual IP addresses and some testing logic.

My question:
All relays would have outgoing connections to only Blocknode A or Backup-Blockproducer B. They will never have simultaenous outgoing connections to both blockproducers.

BUT: Is it okay when Backup-Blockproducer B has outgoing connections to all relays continously regardless of the status of Blockproducer A? In case this is okay, this would reduce the complexity of this setup since I only need to control the IPs the relays connect to.

My question focuses on the difference between outgoing and incoming connections. I could rephrase my question like this: Is it harmful to Cardano’s ecosystem when I have a duplicate of my Blockproducer running live in internet which has outgoing connections to relays - but no relay has outgoing connection to this duplicated blockproducer node?

Thanks so much for your time!

Cheers,
Chris

laplasz · 19 March 2021 23:39

good question. I had the experience that if the block producer node - B - does not have any incoming connections then the created block was not validated. (I planned to ask this in a separate topic what could be the reason for that behavior)

maybe because the created block is not sent to the relay if the connection initiated by the producer…
but if it is not true and it is sent to the relays on the connection which was initiated by the producer that could cause issues - since the other block producer node - A - also created the block.

in general having 2 active/synced block producer node is not a good idea - I would try this setup first on the testnet…

CardanoCafe · 20 March 2021 10:33

Thanks for your answer and your feedback!

The benefit is, when the backup-blockproducer B has permanent outgoing connections (but no incoming connections), it would be synced to the blockchain 100%, so that in case of a failover event it could quickly react and continue the work of forging blocks. However, it needs to be assured that a blockproducer which has only outgoing connections (and no incoming connections) can not harm ecosystem of cardano. That’s the big question here?

I know there are lots of other solutions out there like vmware High Availability. But in case HA can be easily achieved by using tools like HAProxy, then it would be good for the community to share this knowledge or build a guide so that everybody who likes can build a easy HA-setup without having to invest lots of money!

Cheers,
Chris

christobear12 · 21 March 2021 19:26

I’m also interested in an HA solution. I’m trying to experiment with a back up block producer. I think it would work where the producer and backup producer are connected and the relays can use the heartbeat to switch over topology files to connect to the backup producer. I do think the issue would be with the backup producer being able to sync.

Can anyone confirm that the back up producer would sync if it has the relays in it’s topology and if the relays DON’t have the producer in their topolgies?

That could be one option. I saw another option on reddit where a relay was used to switch over to producer mode so it’s already synced. I do have issues with this method though. It probably has keys on it to produce blocks because it’s a back up producer. In relay mode, that would expose a producer to the network. I guess unless that relay was hiding behind another relay and isn’t used to communicate to other producers. Another problem that seems to pop up is KES rotation.
I found a few links that also discuss this in other places. I put the links below.

laplasz · 21 March 2021 20:23

If there is more block producer nodes both of them will create the block - the question is whether the block from the backup node will be populated towards the relays if no incoming relay node connection. Basically a connection should not be differentiated about how initiated the connection.

CardanoCafe · 21 March 2021 20:50

Exactly that’s the question! Looks like we need to learn this by testing - or is there anybody who knows a bit more than we do? And I also absolutely agree on your last sentence… Basically, there should be no difference who initiated the connection.

Inexpert · 22 March 2021 13:09

What if you set it up so that they wasn’t an outgoing connection for your fallback node? I think this could be accomplished with two floating IPs, where the backup one is reserved but not pointed at the relay node:

If I have this right (I might not), then both Block Producer A (BP_A) and BP_B should be able to download and keep a fully up-to-date blockchain. BP_A will communicate out normally through Floating IP A (FIP_A), while no matter what, BP_B cannot reach the relay since it’s a dead end.

In the event of a failure of BP_A, BP_B could call an API that unassigns FIP_A and reassigns FIP_B to R. This would let you keep all of the topology files static while letting the topology itself change.

ethos777 · 18 April 2023 14:55

I’ve added a cron job on each of my BPs that can add/remove firewall rules based on a flag provided by a separate VPS that is pinging all my BP nodes and decides which one to switch my relays to. My stake is small while I’m testing (no blocks so far) but I do see that the node with incoming relay connections is processing transactions while the other BP nodes aren’t. That is expected.

The switch over takes about a minute after cron job kicks in. I see my relays connecting to the new BP node. My script restarts the old BP node as firewall rules don’t disconnect already established connections.

I’m seeing one problem though and I hope someone can chime in on that. The BP node I switch my relays to starts processing transactions after some delay. Sometimes it is a few minutes, sometimes it is a few hours. I don’t know if that is going to be a problem. The cncli ping returns “ok” for that node.

os11k · 18 April 2023 15:58

How you see that relays are being connected to new BP?

ethos777 · 18 April 2023 18:59

There are incoming connections listed in the live view. The number starts increasing soon after I unblock the firewall. I’ve got 4 relays and the number goes from 0 to 4.

os11k · 18 April 2023 19:42

And backup BP has outgoing connections towards relays all the time right?

ethos777 · 18 April 2023 22:09

That is correct. My understanding is that as long as there are no incoming connection there is no interference with the core node. That way I can keep the backup node synced and ready.

mo-moc · 19 April 2023 02:10

I think that any solution via script involving IPs, and firewall changes could have side effects on the network. Most of the stake blockchain networks are designed to work with one BP due to the complexity that represents having multiple potential duplicated blocks at the same time being transmitted to the network.

I’m not 100% on what happens in Cardano when a duplicate block is sent using a cloned BP, but I assume that one block will go and the other might create a fork and be rejected at a later time. Any how, my point here is that there is a reason there is not an official config to do this, meaning is not designed to do so.

I’m not trying to discourage innovation, just giving feedback. I know that for example, Solana had some sort of backup producer config that it acts like a active-pasive system, they both are in sync with the network, but with different status flags and maintained a heartbeat between them to determine who should be promoted to be the active one, but, they ended up discontinuing it cause it was buggy.

MO

Terminada · 19 April 2023 09:06

If your backup BP has connections with any relay which is connected to other Cardano relays then you will produce dual blocks and cause forks.

People are now running nodes with P2P capability on mainnet and these nodes establish bi-directional links (duplex and “full duplex” connections). This means that if your block producer connects to a relay, even if the block producer initiated the connection, the relay can pull data from the block producer if the link is upgraded to a duplex link.

Consider the situation where your block producer is behind a firewall with no open incoming ports:

If your block producer, with P2P capability, connects out through such a firewall to a relay, then that relay may pull blocks from the block producer, even if the firewall does not allow any incoming connections at all. This is because the block producer establised the connection, and the firewall allows packets for established connections, and the relay can upgrade the connection to a duplex or full-duplex connection. So both the block producer and the relay can pull blocks from each other.

The consequence of producing dual blocks is that they can often have different transactions in them depending on what the mempool of each block producer contained at the instant the block was forged. This means that each block will get a different hash and your pool key will have signed each block. It will therefore be obvious to other pools in Cardano that your pool produced “dual blocks” which is very frowned upon because it causes every other node to have to choose one even though both are perfectly valid. The next two block producers could each choose a different block to build upon and therefore one of these block producers will get their block orphaned.

And, they will blame you for their loss. Social pressure from other pools complaining about your malicious and un-Cardano behaviour may cause your delegators to move their stake to other pools.

Just imagine how pissed off a small stake pool operator can be when they finally get their chance to mint a block and someone running dual block producers causes their only block for the month (or year) to be orphaned. Well, I wouldn’t want to be on the other side of that anger.

If you want to try running dual block producers, you had better hope that you only cause one of the Binance pools to get an orphaned block.

os11k · 19 April 2023 09:35

Well, not sure about how it is now. But when I tried block pulling like half year ago and it didn’t worked via outgoing connection from BP towards Relay using duplex connection. if we assume that relay do not connects toward BP, but just BP connects towards relay.

I might be wrong, but my assumption is, that fancy name for duplex is just reusing same ports for connections, but in reality there are still connection from relay and from BP to relay, only they are not using 4 ports as it was in non-duplex, but 2 now.

In my current production, I have no-p2p connection between relay and bps.

My backup BP has constant connection to relay, but not other way around. I do not produce dual blocks.

Again, I personally don’t enable p2p between, producers and relays. I didn’t had chance to retest now again, but I would assume if relay do not connects to BP, even with duplex on it should not be able to pull blocks.

Terminada · 19 April 2023 10:19

If you watch this video at the 5 min mark:

Duncan Coutts outlines how a block producer can connect through a firewall, that blocks all incoming new connections, to a relay, and that relay will be able to pull blocks from the block producer over the duplex connection between them.

It is true that if you continue to run your block producer and relay in legacy mode, then you may be able to prevent these duplex connections.

ethos777 · 19 April 2023 14:00

Another solution is to run your backup node as a relay so that it syncs all the time. If need arises it is very easy to turn into BP, that just requires restart. Switch over takes just a bit longer. Considering possible duplex connections in P2P mode that would be bulletproof alternative. There are just 2 more lines of code to add to my script.

The only problem with that approach is that after restart node might enter into DB validation mode that takes about 40 minutes to complete. I see that happening sometimes, not sure why.

os11k · 19 April 2023 16:24

I personally use haproxy for this and it works flawlessly.

So I have several relays connecting to haproxy and haproxy desides where to send traffic. Fail-over is almost instant. I do not use p2p between relays and BPs.

I took that idea from here:

github.com/input-output-hk/cardano-node

[Validation Query] - In absence of native HA support for core nodes, any risks foreseen due to use of haproxy?

opened 07:43AM - 22 Sep 20 UTC

closed 08:01AM - 27 Oct 22 UTC

rdlrt

enhancement

**Internal/External** *External* **Background** Given that `cardano-node…` does not provide a native way to perform HA, we're thinking about testing and implementing an external solution based on haproxy, especially given that #1132 has been delayed indefinitely. (PS: Previous feature request #1273 for native support didnt receive any ack, so assuming that's not on priority list for next few months) Note that the intention is not to have more availability or reduce maintainance windows , but to have redundancy for core nodes. We're looking at adding some sample instructions to SPOs in [guild-operators docos](https://cardano-community.github.io/guild-operators) for allowing their relay nodes to point to a local haproxy IP:port bound service, which could map out to active/passive connection to multiple cores. Sample relevant haproxy frontend/backend would be: ``` frontend app bind 127.0.0.1:6000 default_backend cnode_core backend cnode_core balance source server c1 IP1:6000 check server c2 IP2:6000 check backup ``` Each of the relay node would connect locally to a ha proxy service and get redirected to active core node to fetch blocks. The only edge-case/disadvantage we could think of was when if there is a block created right before an outage on active core node, it could potentially result in a temporary fork branch that should be resolved automatically. But the advantage from HA far outweighs a short inconvenience from unintentionally created fork during edge case scenario. The blocks from passive core wouldn't be pulled unless there is really an outage **Question** Is this a viable approach? Do the developers foresee any issues with the approach (eg: rollbacks on backup node)?

ethos777 · 19 April 2023 16:35

But your BP still connects to your relays in order to get synced, right? In legacy mode to eliminate duplex connections? I’m wondering if relay is in P2P and BP is in legacy mode is duplex still possible?

os11k · 19 April 2023 16:55

Duplex as is for now should be enabled in config file using "TestEnableDevelopmentNetworkProtocols": true

So as far as it is not enabled, your node should not use it, even in p2p

Topic		Replies	Views
HA Stake Pool Cluster Operate a Stake Pool	3	913	20 August 2020
Why does the relay-node topology file contain the block-producer ip address? Operate a Stake Pool	4	524	13 June 2021
What might the process be for failover cardano-node services? Setup a Stake Pool cardano-node	14	1081	21 September 2021
2nd relay couldn't connect to block producer Operate a Stake Pool	15	1054	24 March 2021
Is it safe to have your block producing node connect to public relay nodes instead of your own private relay? Cardano Projects cardano	6	661	25 May 2021

Topology between Relays and Blockproducers in a High Availability Scenario

Related topics