Good monitoring metrics for a "healthy" coreNode/relayNode?

Hi,

to prepare some scripting for the MainNet / HTN Operation, what do you guys think are good parameters to monitor in an automated way to decide if the node is healthy or not? So warning messages can be produced and also the node can be automatically restarted for example.

  • Slot/BlockHeight growing in a period of Time
  • Block production of the coreNode itself
  • CPU and RAM usage
  • KES/Opcert periods valid

what else?

I would monitor rejected blocks and maybe connected relay node count

i would say, if the slotnumber/blocknumber is not increasing in a given interval, this would also cover this automatically?

rejected blocks yes, but that is more a performance parameter than a health parameter?

Here is what I track on the core node:

1 Like

i like you live view, but i am searching for monitoring parameters that are monitoring the node directly on the machine itself and triggering some alarm actions. :slight_smile:

What did you end up using @ATADA as health metric?

Helo, do you have guide to do this monitoring ?

I personally made my own monitoring tools… but they are not public available. But you can also look at our StakePoolOperator Tools Alliance Site here:

You can use the simpleLiveView from Adam or gLiveView from the Guild to check your running node.

1 Like