Block Producer - Failover Approach with a BP Standby

Markus-VITAL · 11 August 2021 20:48

Since we initially setup up our Pool we were running 2 BPs with a failover approach.
The setup consists of

one BP which runs as a BP (active)
one BP which runs as a Relay but has the required files to be able to run as a BP (Standby)

The failover procedure was handled manually based on an SMS Alert triggered from Healthchecks if the TIP of the server was running old. The according to Script was shared before here (GitHub - ResponsibleStaking/Cardano-SPO-Scripts)

Now we went to the next step and fully automated the failover.

The idea is to use an external service to receive TIP updates from the individual BPs. The services then take failover decisions based on the TIP. Not just based on the availability of the Port or server in general. Reporting the TIP is a very integrative validation of a healthy system status because it would fail if the server is disconnected, down, not in sync, running of disk space. Many reasons can lead to failure.

If the TIP is getting older than 5 minutes and the TIP of the failover is fresher a failover is initiated.
The Standby is taking over. Failover is executed on the client level by the script which sends the TIP. This script exuctes a activation or de-activation of the server. This can be handled by just copying over the ENV File and restart the service.

Currently this runs in Dryrun. Some simulated scenarios were handled well.

Advantages of this approach:

TIP is very integratively validating system health
Servers may reside in any location (no need to be on the same internal network)
No Inbound connectivity on the server required, Heartbeat is just sending info out and the response contains the new desired server status

Limitations of the approach:

There may still be other indicators for a bad server status like bad (slow) network connection or time-syncing issues which are not reported to the service currently
The installation is currently very manual and involves many steps.

If you are interested check it out on GitHub: GitHub - ResponsibleStaking/Cardano-Heartbeat-Failover: AWS based Cloud Service to take Failover Decis

I’m happy about any Feedback. I know the approach is different to what other SPOs are doing. So if I missed anything, please let me know!

Please note: I do not take overy any kind warranty for the provided code. Use it at your own risk.

CardanoIstheWay · 14 August 2021 16:22

IMHO, all relays should be configured as bps as a best practice, since it doesn’t add any overhead to do that.

Markus-VITAL · 14 August 2021 17:35

I don’t agree. I would not provide files on the servers which are not absolutely necessary.

CardanoIstheWay · 14 August 2021 21:19

What extra files? There would not be any keys or additional active code on them. At least, I haven’t observed any guides that specify additional code for a core node.

If my thinking is wrong, I would appreciate the feedback.

Markus-VITAL · 14 August 2021 21:47

My understanidng is the following:

The Relays do not need any pool specific files.
The BP(s) require the Operational Certificate, VRF Signing Key and KES Signing key

CardanoIstheWay · 14 August 2021 22:03

Right. Since one would not put any keys on the relay and there is no additional active code, setting all relays up to be core nodes is simple and safe, no?

By putting keys on the relay, it’s a dormant (?) core node. That’s not what I was suggesting. Hence, I think you might have misunderstood me.

Markus-VITAL · 14 August 2021 22:22

I’m not sure if I get what you are suggesting. Please explain what you mean with “setting all relays up to be core nodes” and how this is relevant for the failover approach.

CardanoIstheWay · 14 August 2021 23:18

Basically, just having everything ready to go except the keys (and cert). So, for example, if one is using CNTools, one would configure everything necessary to make it a core node should the need arise, but not actually go through the steps (i.e. don’t generate keys).

Clarified?

Markus-VITAL · 15 August 2021 06:04

Ok. Yes. The software installation is exactly the same. I installed one Relay first and then built the others from a Snapshot of this machine. Just env, topo, firewall rules, keys and active services are different. E.g. Topo Updater only on Relays. SendTip only on BP.

atdtpool · 15 August 2021 18:47

I’ve been trying to find a rational explanation for why running two or more active BPs would be bad. So far, it just seems that it would just be inelegant because it would result in slot battles with yourself. But… isn’t that better than having a single point of failure? (What’s the practical drawback?)

Alexd1985 · 15 August 2021 19:05

If both nodes will create the block… the network will penalize the servers… that’s why u can’t use more Producers with same vrf files same time

atdtpool · 15 August 2021 19:22

Thanks for this! Is this documented anywhere?

Alexd1985 · 15 August 2021 19:40

I think I read in the past but I can’t find it now

Markus-VITAL · 15 August 2021 20:22

It’s not penaltized to my knowledge. Maybe it will be. But it’s generating trash blocks which need to be reverted. Not a good practice to do it that way.

Alexd1985 · 15 August 2021 20:23

Aaa ok, anyway… the blocks will not be valid right?

Markus-VITAL · 15 August 2021 20:39

One of the 2 blocks will survive

Alexd1985 · 15 August 2021 20:41

You are saying that if now I will run 2 Producers same time I will still create the blocks? Hmm then why no one run 2 Producers?

atdtpool · 15 August 2021 20:42

OP: this looks like an interesting approach, thanks for sharing the code. Please let us know if you end up making this live and how it works out in practice.

Markus-VITAL · 15 August 2021 20:57

Please do not consider this as a valid approach! It should work but is not an intended scenario. Thats why i looked into a good failover approach. I recognized some which are switching on network level. Like the failoverd based approach. But this only works with nodes on the same network. So I started to work on my own approach.

So far the Dryrun did nothing unexpected. Simulated a network issue and a stopped node. Both resulted in a switchover. Long time without new slots during Epoch border did not trigger a switchover. So everything looks good and i will take it active soon.

atdtpool · 15 August 2021 21:33

I still haven’t heard a practical/specific reason for not running concurrent core nodes, aside from style concerns. This may generate “trash blocks” but those are quickly discarded. Abandoned forks (invalid blocks) are common, right? If there was a practical concern regarding overhead for the consensus protocol, then a significant DDoS attack on Cardano would be trivial, wouldn’t it?

I would love if somebody could point out the game-theoretical downside for an SPO doing this.

Topic		Replies	Views
What might the process be for failover cardano-node services? Setup a Stake Pool cardano-node	14	1083	21 September 2021
Is it possible to use more than one server for the stakepool ? Setup a Stake Pool	7	519	19 September 2021
New to SPs - Backup Block Producer, Server sizing Setup a Stake Pool	6	810	2 June 2022
Failover - how to setup a secondary BP node Operate a Stake Pool	1	278	24 June 2021
HA Stake Pool Cluster Operate a Stake Pool	3	915	20 August 2020

Block Producer - Failover Approach with a BP Standby

Related topics