Since we initially setup up our Pool we were running 2 BPs with a failover approach.
The setup consists of
- one BP which runs as a BP (active)
- one BP which runs as a Relay but has the required files to be able to run as a BP (Standby)
The failover procedure was handled manually based on an SMS Alert triggered from Healthchecks if the TIP of the server was running old. The according to Script was shared before here (GitHub - ResponsibleStaking/Cardano-SPO-Scripts)
Now we went to the next step and fully automated the failover.
The idea is to use an external service to receive TIP updates from the individual BPs. The services then take failover decisions based on the TIP. Not just based on the availability of the Port or server in general. Reporting the TIP is a very integrative validation of a healthy system status because it would fail if the server is disconnected, down, not in sync, running of disk space. Many reasons can lead to failure.
If the TIP is getting older than 5 minutes and the TIP of the failover is fresher a failover is initiated.
The Standby is taking over. Failover is executed on the client level by the script which sends the TIP. This script exuctes a activation or de-activation of the server. This can be handled by just copying over the ENV File and restart the service.
Currently this runs in Dryrun. Some simulated scenarios were handled well.
Advantages of this approach:
- TIP is very integratively validating system health
- Servers may reside in any location (no need to be on the same internal network)
- No Inbound connectivity on the server required, Heartbeat is just sending info out and the response contains the new desired server status
Limitations of the approach:
- There may still be other indicators for a bad server status like bad (slow) network connection or time-syncing issues which are not reported to the service currently
- The installation is currently very manual and involves many steps.
If you are interested check it out on GitHub: GitHub - ResponsibleStaking/Cardano-Heartbeat-Failover: AWS based Cloud Service to take Failover Decis
I’m happy about any Feedback. I know the approach is different to what other SPOs are doing. So if I missed anything, please let me know!
Please note: I do not take overy any kind warranty for the provided code. Use it at your own risk.