Block Producer - Failover Approach with a BP Standby

Since we initially setup up our Pool we were running 2 BPs with a failover approach.
The setup consists of

  • one BP which runs as a BP (active)
  • one BP which runs as a Relay but has the required files to be able to run as a BP (Standby)

The failover procedure was handled manually based on an SMS Alert triggered from Healthchecks if the TIP of the server was running old. The according to Script was shared before here (GitHub - ResponsibleStaking/Cardano-SPO-Scripts)

Now we went to the next step and fully automated the failover.

The idea is to use an external service to receive TIP updates from the individual BPs. The services then take failover decisions based on the TIP. Not just based on the availability of the Port or server in general. Reporting the TIP is a very integrative validation of a healthy system status because it would fail if the server is disconnected, down, not in sync, running of disk space. Many reasons can lead to failure.

If the TIP is getting older than 5 minutes and the TIP of the failover is fresher a failover is initiated.
The Standby is taking over. Failover is executed on the client level by the script which sends the TIP. This script exuctes a activation or de-activation of the server. This can be handled by just copying over the ENV File and restart the service.

Currently this runs in Dryrun. Some simulated scenarios were handled well.

Advantages of this approach:

  • TIP is very integratively validating system health
  • Servers may reside in any location (no need to be on the same internal network)
  • No Inbound connectivity on the server required, Heartbeat is just sending info out and the response contains the new desired server status

Limitations of the approach:

  • There may still be other indicators for a bad server status like bad (slow) network connection or time-syncing issues which are not reported to the service currently
  • The installation is currently very manual and involves many steps.

If you are interested check it out on GitHub: GitHub - ResponsibleStaking/Cardano-Heartbeat-Failover: AWS based Cloud Service to take Failover Decis

I’m happy about any Feedback. I know the approach is different to what other SPOs are doing. So if I missed anything, please let me know!

Please note: I do not take overy any kind warranty for the provided code. Use it at your own risk.

IMHO, all relays should be configured as bps as a best practice, since it doesn’t add any overhead to do that.

I don’t agree. I would not provide files on the servers which are not absolutely necessary.

2 Likes

What extra files? There would not be any keys or additional active code on them. At least, I haven’t observed any guides that specify additional code for a core node.

If my thinking is wrong, I would appreciate the feedback. :slight_smile:

My understanidng is the following:

  • The Relays do not need any pool specific files.
  • The BP(s) require the Operational Certificate, VRF Signing Key and KES Signing key

Right. Since one would not put any keys on the relay and there is no additional active code, setting all relays up to be core nodes is simple and safe, no?

By putting keys on the relay, it’s a dormant (?) core node. That’s not what I was suggesting. Hence, I think you might have misunderstood me. :slight_smile:

I’m not sure if I get what you are suggesting. Please explain what you mean with “setting all relays up to be core nodes” and how this is relevant for the failover approach.

Basically, just having everything ready to go except the keys (and cert). So, for example, if one is using CNTools, one would configure everything necessary to make it a core node should the need arise, but not actually go through the steps (i.e. don’t generate keys).

Clarified? :slightly_smiling_face:

Ok. Yes. The software installation is exactly the same. I installed one Relay first and then built the others from a Snapshot of this machine. Just env, topo, firewall rules, keys and active services are different. E.g. Topo Updater only on Relays. SendTip only on BP.

I’ve been trying to find a rational explanation for why running two or more active BPs would be bad. So far, it just seems that it would just be inelegant because it would result in slot battles with yourself. But… isn’t that better than having a single point of failure? (What’s the practical drawback?)

If both nodes will create the block… the network will penalize the servers… that’s why u can’t use more Producers with same vrf files same time

Thanks for this! Is this documented anywhere?

I think I read in the past but I can’t find it now

It’s not penaltized to my knowledge. Maybe it will be. But it’s generating trash blocks which need to be reverted. Not a good practice to do it that way.

Aaa ok, anyway… the blocks will not be valid right?

One of the 2 blocks will survive

You are saying that if now I will run 2 Producers same time I will still create the blocks? Hmm then why no one run 2 Producers?

OP: this looks like an interesting approach, thanks for sharing the code. Please let us know if you end up making this live and how it works out in practice.

Please do not consider this as a valid approach! It should work but is not an intended scenario. Thats why i looked into a good failover approach. I recognized some which are switching on network level. Like the failoverd based approach. But this only works with nodes on the same network. So I started to work on my own approach.

So far the Dryrun did nothing unexpected. Simulated a network issue and a stopped node. Both resulted in a switchover. Long time without new slots during Epoch border did not trigger a switchover. So everything looks good and i will take it active soon.

I still haven’t heard a practical/specific reason for not running concurrent core nodes, aside from style concerns. This may generate “trash blocks” but those are quickly discarded. Abandoned forks (invalid blocks) are common, right? If there was a practical concern regarding overhead for the consensus protocol, then a significant DDoS attack on Cardano would be trivial, wouldn’t it?

I would love if somebody could point out the game-theoretical downside for an SPO doing this.