Cardano-node healthchecks

Is there a way to perform a health-check on a cardano-node? I’m trying to set DNS failover to point to another instance whenever the block producer goes down for maintenance or unknown reasons.

Also if 1 block producer goes down and another one picks up the block production will that cause any block validation issues?

You could use grafana with the node exporter and scripts explained here:

Hi!

if you query the tip and store it then after 1 min query again and compare it with the previous value…
if the numbers are the same - something happened…

This would trigger quite often. On average a block is generated every 20 seconds. Some times a block is missed. But generally the approach to check the TIP is a very integrative approach. Any issue would be identified by it.

In this post I described how I’m running my TIP Check. If it’s not reporting updates for a defined time period I consider the node offline.

If you want to automate the failover it is important that after the switchover the older server also knows that he should not come up again. E.g. if it was a networking issue and is reloved laterwards the node should find out that it is not the active node any more and must not start as a validator.

@zwirny are you automating the failover, or is it a manual step?

I’m currently running my pool in AWS and I don’t think the ELB allows you to customize the failover. Currently I’m looking at other solutions. I’m curious to know how you were able to create a reliable failover solution.

I think the Amazon ELB is able to call a healthcheck URL which needs to indicate the status of the server through a response code. But just switching the origin of the ELB is not the solution. If the other server would still run ore come up laterwards and runs in parallel with the failover you have 2 block products at the same time which will leat to slot battles between them.

In my opinion you do not need a LB. Both machines can be running at the same time, but the standby needs to be configured to be a started as a relay. In case of the switchover it starts as a validator.

I have documented the steps which need to be done in this case and tried as part of my update process. I have not fully automated it so far but planning to do so.

Concept for the automated failover:

  • Active BP and Standby BP are both reporting if the TIP is current (not older than 1 min) in a 1 min interval → E.g. to Healthchecks.io
  • If one of the nodes is not reporting a current TIP for 5 minutes the check turns to NOK
  • A small cloud function determines the status of both checks and defines which of the servers is currently the active node (if the current active is NOK and the other is online it defines the other as active, instead it stays as it is)
  • Every node asks the cloud function every minute which status it should have. If the status is not matching the current runmode (relay or validator) it restarts it in the according mode.
  • Start scripts need to be manipulated to ask the cloud function as well before starting and potentially switch configuration.

That sounds exactly like what should happen in the case of the a failover. I was thinking about creating a custom implementation like the one mentioned above, but I was also looking at using HA Proxy and keepalived. I believe it already solves the problem out of the box, but I’m not sure about the last step about switching from BP to Relay after the old BP comes back up.

Can you share the scripts you have developped for the failover?

For sure there is something out there which can do this OOTB.
I’m struggling with the keepalived approach because I do not want switch any IP.
I want to have both nodes running all time. Just want to restart the failover as a master (BP) if the old Master is not available any more (for what reason ever including no network connectivity)
Still one thing on my todo list. If somone comes accross with an already working approach I’d also be happy :slight_smile:

I am using noip to point at my producer am thinking this could be await of avoiding changing ips as noip could be started on the failover

In the meanwhile I worked on a custom failover approach which I shared here: Block Producer - Failover Approach with a BP Standby

1 Like