Cardano-node healthchecks

Luis_Betancourt · 28 May 2021 17:17

Is there a way to perform a health-check on a cardano-node? I’m trying to set DNS failover to point to another instance whenever the block producer goes down for maintenance or unknown reasons.

Also if 1 block producer goes down and another one picks up the block production will that cause any block validation issues?

ADA4Good · 28 May 2021 19:51

You could use grafana with the node exporter and scripts explained here:

laplasz · 28 May 2021 20:13

Hi!

if you query the tip and store it then after 1 min query again and compare it with the previous value…
if the numbers are the same - something happened…

Markus-VITAL · 29 May 2021 22:09

This would trigger quite often. On average a block is generated every 20 seconds. Some times a block is missed. But generally the approach to check the TIP is a very integrative approach. Any issue would be identified by it.

In this post I described how I’m running my TIP Check. If it’s not reporting updates for a defined time period I consider the node offline.

If you want to automate the failover it is important that after the switchover the older server also knows that he should not come up again. E.g. if it was a networking issue and is reloved laterwards the node should find out that it is not the active node any more and must not start as a validator.

Luis_Betancourt · 1 June 2021 19:46

@Markus-VITAL are you automating the failover, or is it a manual step?

Luis_Betancourt · 1 June 2021 20:05

I’m currently running my pool in AWS and I don’t think the ELB allows you to customize the failover. Currently I’m looking at other solutions. I’m curious to know how you were able to create a reliable failover solution.

Markus-VITAL · 1 June 2021 21:47

I think the Amazon ELB is able to call a healthcheck URL which needs to indicate the status of the server through a response code. But just switching the origin of the ELB is not the solution. If the other server would still run ore come up laterwards and runs in parallel with the failover you have 2 block products at the same time which will leat to slot battles between them.

In my opinion you do not need a LB. Both machines can be running at the same time, but the standby needs to be configured to be a started as a relay. In case of the switchover it starts as a validator.

I have documented the steps which need to be done in this case and tried as part of my update process. I have not fully automated it so far but planning to do so.

Markus-VITAL · 1 June 2021 22:12

Concept for the automated failover:

Active BP and Standby BP are both reporting if the TIP is current (not older than 1 min) in a 1 min interval → E.g. to Healthchecks.io
If one of the nodes is not reporting a current TIP for 5 minutes the check turns to NOK
A small cloud function determines the status of both checks and defines which of the servers is currently the active node (if the current active is NOK and the other is online it defines the other as active, instead it stays as it is)
Every node asks the cloud function every minute which status it should have. If the status is not matching the current runmode (relay or validator) it restarts it in the according mode.
Start scripts need to be manipulated to ask the cloud function as well before starting and potentially switch configuration.

Luis_Betancourt · 1 June 2021 22:28

That sounds exactly like what should happen in the case of the a failover. I was thinking about creating a custom implementation like the one mentioned above, but I was also looking at using HA Proxy and keepalived. I believe it already solves the problem out of the box, but I’m not sure about the last step about switching from BP to Relay after the old BP comes back up.

Lionel · 3 June 2021 13:05

Can you share the scripts you have developped for the failover?

Markus-VITAL · 4 June 2021 22:56

For sure there is something out there which can do this OOTB.
I’m struggling with the keepalived approach because I do not want switch any IP.
I want to have both nodes running all time. Just want to restart the failover as a master (BP) if the old Master is not available any more (for what reason ever including no network connectivity)
Still one thing on my todo list. If somone comes accross with an already working approach I’d also be happy

Lionel · 7 June 2021 09:50

I am using noip to point at my producer am thinking this could be await of avoiding changing ips as noip could be started on the failover

Markus-VITAL · 11 August 2021 20:50

In the meanwhile I worked on a custom failover approach which I shared here: Block Producer - Failover Approach with a BP Standby

Topic		Replies	Views
Block Producer - Failover Approach with a BP Standby Operate a Stake Pool	63	3074	24 February 2022
HA Stake Pool Cluster Operate a Stake Pool	3	933	20 August 2020
Hypothetical (having duplicate Block producers) 1 cloud 1 home Operate a Stake Pool	24	1840	10 March 2021
Redundancy architecture what i have in plan -- please criticize it Setup a Stake Pool	6	624	12 May 2022
What might the process be for failover cardano-node services? Setup a Stake Pool cardano-node	14	1097	21 September 2021

Cardano-node healthchecks

Related topics