Redundancy architecture what i have in plan -- please criticize it

Hi,

I’m exploring Cardano Staking Pools universe and I plan to build my own pool, seems I got one in tesnet already(probably you are aware of this, due high amount of messages from me in other threads :slight_smile: ). One requirement for successful pool is redundancy, IMHO. As all we know eventually any equipment will fails, so we should be ready for such case. And I have in mind following redundancy/fail-over solution.

So I plan to have 2 relay nodes, what are a very straightforward. Those relay nodes will point to dns name of producer node, lets call this instance as “producer”. Meanwhile I will have one more producer, lets call it “backup-producer” running with same keys and totally same as “producer”. As far as I understand, this is not an issue, until relays are connected just to 1 producer in normal operation it is “producer”.

For redundancy I plan to configure DNS hostnames in AWS and AWS has option to test remote systems with TCP, so in case of “producer” will not reply to TCP connections from AWS, AWS Route53 will switch over DNS names, so now DNS record points towards “backup-producer” and now relays should be able to connect again and no restart or config change is needed on relays, “producer” or “backup-producer”. When TCP is back again working towards “producer”, we switch back DNS name.

What I don’t like is that AWS will do just TCP connection check, probably I would like to have cncli ping remote servers and in case of no response in 3 or 5 in a row, then we will do switchover. This probably will need one more server like zabbix what can do this and in case of such problem it can fire up script to change DNS on AWS side. But this probably is too much for beginning.

Additionally it is possible to get source ip addresses of AWS servers what will do health check, but it will be quite a lot firewall entries.

We need make sure that no flapping is happening too, so probably we should switch to “backup-producer” only after 5 minutes of time when “producer” is not availlable and back only when “producer” is available for at least 30 minutes or in case when “backup-producer” is down.

I did quite similar setup in my other project, but it was while ago, so I might mixing up something, but generally this should work in my opinion.

I personally like this approach because there is not much configuration needed, just sample config on AWS side and no restart needed of any node - relay, “producer” or “backup-producer”

What you guys think? Any comments or advice are highly welcomed.

I bit tested following set-up. And everything seems work well in terms of failover to backup producer. Only problem I have now that seems relays do not want to switch back to main “producer” after it went online.

Seems relays saves IP and then never make any additional dns request to update it. So if we have main producer down and then we switched back traffic to backup-producer, everything works ok. But when main producer went back online, relays are still connected to backup-producer and don’t even bother to reconnect to main producer, probably because connection is in place and they do not need to make any additional DNS request.

I even noticed that after more then a day leaving it as it is, I had one relay connected to main producer and second relay to backup-producer and this seems is quite bad.

So solution would be, to restart backup producer after main producer is online. Additionally we can close FW manually on backup-producer to allow relays to connect back to main producer and then to remove those FW rules. Probably best way would be some kind of automation script what will run on event when main producer is offline and when it came back online.