What specs would you guys recommend for the monitoring server (memory, vcpu & disk space)?
The server will be responsible for ETL of all data related to the nodes (2 relays and 1 BP), their environments and the blockchain.
It will be running things like Prometheus, Grafana, DBSync (might require a node).
I recommend you check out: Guide: How to build a Cardano Stake Pool - CoinCashew It’s an excellent guide. Just today I finished configuring the pool on the testnet, with Grafana and Prometheus. You will see in the guide a section dedicated to this last point. NOTE: no special node is required (Grafana is installed on one of the relay nodes).
Including the effect, that if you you lose this node for whatever reason, your monitoring (which should make you aware of this) is dead as well… doh
I too suggest going with a separate, cheap node hosting Grafana, Prometheus. Reasons: mostly what @hanswurst mentioned + no need to expose additional ports and run additional software on the Cardano node which might affect security (more attack surface).
Thanks for the link.
I have nodes running in the TestNet with Prometheus and Grafana all up and working so all is good.
I am busy designing my MainNet architecture and would like to keep monitoring software off of my nodes but I read somewhere that DBSync should be run off the same server as the node because the amount of data written is too much for a LAN.
Do you think a cheap (low spec) node will suffice? Are these things not quite resource intensive?
The comment above applies to Prometheus and Grafana, especially the Grafana frontend. The node would be just a metrics aggregator. My monitoring node consumes about 300MB of RAM. The load is very low too. The disk space may be a factor due to how much data is being stored. I suggest you give it a try and see how it goes. You still need to run the Prometheus metrics exporters on all nodes that are monitored.
Thanks, I will start low and see.
I am doing this in AWS, do you think a 2vcpu, 4Gb memory server will be enough?
How much disk space does it use up?
That should be more than enough to run Prometheus and Grafana. The disk space depends on the retention time and number of metrics. I don’t have disk space shortages but let’s say for a 15 day retention time I think you should be good with 15+GB (including the OS).
Actually, I decided to use some simple python scripts for monitoring together with sending alarms by email. That way, I don’t need to watch Grafana graphs and can still step-in in case something fails.
Currently, I’m monitoring tx processed on relay and node and return codes from topology-update. Used to monitor missed slots as well, but that’s no major concern anymore.
Recently, I’m getting false alarms from relay not restart within 5 minutes after topology updates going effective. Seems to be the case with the Alonzo enabled versions of cardano-node that they take longer to restart.
If someone is interested there’s an example python script for topology-updater on github: