Hi,
I’m exploring Cardano Staking Pools universe and I plan to build my own pool, seems I got one in tesnet already(probably you are aware of this, due high amount of messages from me in other threads ). One requirement for successful pool is redundancy, IMHO. As all we know eventually any equipment will fails, so we should be ready for such case. And I have in mind following redundancy/fail-over solution.
So I plan to have 2 relay nodes, what are a very straightforward. Those relay nodes will point to dns name of producer node, lets call this instance as “producer”. Meanwhile I will have one more producer, lets call it “backup-producer” running with same keys and totally same as “producer”. As far as I understand, this is not an issue, until relays are connected just to 1 producer in normal operation it is “producer”.
For redundancy I plan to configure DNS hostnames in AWS and AWS has option to test remote systems with TCP, so in case of “producer” will not reply to TCP connections from AWS, AWS Route53 will switch over DNS names, so now DNS record points towards “backup-producer” and now relays should be able to connect again and no restart or config change is needed on relays, “producer” or “backup-producer”. When TCP is back again working towards “producer”, we switch back DNS name.
What I don’t like is that AWS will do just TCP connection check, probably I would like to have cncli ping
remote servers and in case of no response in 3 or 5 in a row, then we will do switchover. This probably will need one more server like zabbix what can do this and in case of such problem it can fire up script to change DNS on AWS side. But this probably is too much for beginning.
Additionally it is possible to get source ip addresses of AWS servers what will do health check, but it will be quite a lot firewall entries.
We need make sure that no flapping is happening too, so probably we should switch to “backup-producer” only after 5 minutes of time when “producer” is not availlable and back only when “producer” is available for at least 30 minutes or in case when “backup-producer” is down.
I did quite similar setup in my other project, but it was while ago, so I might mixing up something, but generally this should work in my opinion.
I personally like this approach because there is not much configuration needed, just sample config on AWS side and no restart needed of any node - relay, “producer” or “backup-producer”
What you guys think? Any comments or advice are highly welcomed.