Redundancy architecture what i have in plan -- please criticize it

Hi,

I’m exploring Cardano Staking Pools universe and I plan to build my own pool, seems I got one in tesnet already(probably you are aware of this, due high amount of messages from me in other threads :slight_smile: ). One requirement for successful pool is redundancy, IMHO. As all we know eventually any equipment will fails, so we should be ready for such case. And I have in mind following redundancy/fail-over solution.

So I plan to have 2 relay nodes, what are a very straightforward. Those relay nodes will point to dns name of producer node, lets call this instance as “producer”. Meanwhile I will have one more producer, lets call it “backup-producer” running with same keys and totally same as “producer”. As far as I understand, this is not an issue, until relays are connected just to 1 producer in normal operation it is “producer”.

For redundancy I plan to configure DNS hostnames in AWS and AWS has option to test remote systems with TCP, so in case of “producer” will not reply to TCP connections from AWS, AWS Route53 will switch over DNS names, so now DNS record points towards “backup-producer” and now relays should be able to connect again and no restart or config change is needed on relays, “producer” or “backup-producer”. When TCP is back again working towards “producer”, we switch back DNS name.

What I don’t like is that AWS will do just TCP connection check, probably I would like to have cncli ping remote servers and in case of no response in 3 or 5 in a row, then we will do switchover. This probably will need one more server like zabbix what can do this and in case of such problem it can fire up script to change DNS on AWS side. But this probably is too much for beginning.

Additionally it is possible to get source ip addresses of AWS servers what will do health check, but it will be quite a lot firewall entries.

We need make sure that no flapping is happening too, so probably we should switch to “backup-producer” only after 5 minutes of time when “producer” is not availlable and back only when “producer” is available for at least 30 minutes or in case when “backup-producer” is down.

I did quite similar setup in my other project, but it was while ago, so I might mixing up something, but generally this should work in my opinion.

I personally like this approach because there is not much configuration needed, just sample config on AWS side and no restart needed of any node - relay, “producer” or “backup-producer”

What you guys think? Any comments or advice are highly welcomed.

I bit tested following set-up. And everything seems work well in terms of failover to backup producer. Only problem I have now that seems relays do not want to switch back to main “producer” after it went online.

Seems relays saves IP and then never make any additional dns request to update it. So if we have main producer down and then we switched back traffic to backup-producer, everything works ok. But when main producer went back online, relays are still connected to backup-producer and don’t even bother to reconnect to main producer, probably because connection is in place and they do not need to make any additional DNS request.

I even noticed that after more then a day leaving it as it is, I had one relay connected to main producer and second relay to backup-producer and this seems is quite bad.

So solution would be, to restart backup producer after main producer is online. Additionally we can close FW manually on backup-producer to allow relays to connect back to main producer and then to remove those FW rules. Probably best way would be some kind of automation script what will run on event when main producer is offline and when it came back online.

Currently I decided to go with Haproxy.

So I have 2 Haproxies and behind of them there are 2 BPs. Pull of relays will connect to those Haproxies and Haproxy then will send request to BPs. All request should go to main-BP and if main-BP is down, Haproxy should failover to backup-BP.

That setup seems to work fine, just sometimes it happens split brain, when 1st Haproxy connects to main-BP, but second to backup-BP. That issues must be addressed. I personally have monitoring in Grafana(Haproxy exports metrics to Prometheus, what is super useful) and if such issue happen I will receive alert. In ideal world would be nice to restart second Haproxy in case of such events, so connection will be established back to main-BP, without human intervention and probably my next task.

One more point to remember that connection to BPs are constant and if for example Haproxy sees that Main-BP is down, then it will open connection to Backup-BP and connection will stay open all the time and no switchover to Main-BP happens, until we restart Haproxy for example.

I got inspired by that git issue:

Everything is a trade-off:

  • How much stake is your pool likely to control?
  • How many blocks is your pool likely to produce per epoch?

If your pool will likely only produce a few blocks per epoch or less then maybe some restart time can be factored in. Cardano-node software is very stable and runs for many days without any noticeable memory leaks or crashes. I only restart my nodes to load new topology files and I can easily pick a window between blocks to do this. Furthermore, my internet connection has something like >99.9% uptime. You will likely lose more blocks to slot battles than internet downtime or server failures.

Do you think the extra redundancy you are designing and its real world effectiveness will be worth it?

It is definitely a worth, IMHO. Just recently my hosting company(OVH) had downtime for couple of hours. Without proper HA setup I probably would be super stressed, maybe even lost a block.

I have no problem with cardano software at all, but hw will fail at some point and even 99.9% is not good enough, at least for me. Keep in mind that smaller stake you have then less blocks you will get and those blocks will become more and more valuable. Imagine getting one block in couple of months and then exactly at that time internet disappears? If big pool will loose 1 block then it just meant that their return for epoch will be slightly less then expected. And imagine if your stake is 100k and you lost your block which you waited 10 epochs.

I have some stake in my pool, big thanks to Cardana Fondation for that.

1 Like

Everything is relative. The risk of a lost block due to a slot battle or a “propagation delay battle” is around 2-5%. Sometimes I think the fear of missing a block is higher than the actual statistics of how much downtime you get with your internet / hardware. I truly understand how valuable each block is for a small pool, but what percentage downtime do you actually get? What is your expected hardware failure rate? What is your expected internet failure rate?

Regarding internet failure:
You can use a backup network connection continuously. For example, you can even get a usb device like this: USB 4G LTE-Advanced Modem for GNU/Linux (TPE-USB4GLTE) | ThinkPenguin.com

This can circumvent most of your internet failure risk with zero switching time loss. I have a relay connecting to my block producer over two different network links in case one goes down.

Depending on how you do things, if you use firewalling techniques to switch between backups you will still need some sort of recognition system to trigger the change. This “recognition” will likely have a delay of a minute or two. It won’t be as fast as a continuous backup network connection. Even the simple usb mobile broadband device from thinkpenguin might be better.

Regarding hardware failure:
At a pinch, you can always re-purpose a relay to be a block producer. This will only cause a couple of minutes re-start delay before it will be ready. This will require manual intervention from you, but how often will you need to do it?

I don’t think it is good idea to bring up here slot battles what can’t be really mitigated, I’m more concerned on something what I can fix, rather what I can’t.

In any case, I personally do have experienced issues with my internet provider and with my hosting before at same time when I was scheduled for a block, so in my experience you should be ready.

Haproxy has almost instant failover, it is tracking TCP connection all the time and when TCP is down it will switchover. It is definitely faster then starting relay as BP.

For me personally haproxy provides much simple, easier and cheaper solution then additional 4g modem or relay what can be used as BP, what might work for you, I have no problem with that.

Additionally I would like to add that for me is important to have full infrastructure redundancy in several different locations, imagine your house with BP and relays will catch fire or all your equipment is stolen, your 4g modem will not help with that, so you might to put that in your plan too.