Here are some things I needed to do to feel comfortable taking my eyes off the stake pool node log files every once in a while, after the initial days of staring at them like that creepy guy in The Matrix:
- Make browser bookmarks to general performance metrics for cloud resources for each node (like DigitalOcean droplet graphs, a bit simple minded but effective; or the AWS CloudWatch pages, not the detailed paid upgrade but just the free ones) to be sure RAM or CPU use isn’t clipping that the maximum for any of the instances.
- Install a
systemd script that restarts the node in way that is absolutely bullet proof. Keep trying to kill that node by sending it every signal in the book & see if you can get it to fail to restart. Keep rebooting the node to see if you can make the
cardano-node not start.
- Make sure nothing delays restarting the server itself. Ubuntu for instance will have a 90 second timeout if any
systemd service (usual suspect =
cardano-node) doesn’t “exit gracefully” when sent a kill signal. You can change that timeout from 90 seconds to 2 seconds safely on an an SSD based system and then your AWS node or DO droplet can bounce up & down like a basketball: with an SSD root, 20 to 25 seconds for a full reboot including starting
- Keep reading log files in
SimpleView until you can recognise signs of trouble, like troublesome peers in
topology.json. I started keeping a file called
shitlist in the config directory and adopted a 3-strikes-and-you’re-out approach for repeat appearances of “connection refused” for any relay in the log files. (This manual part of the process will go away in the near future when the promised topology management by the node itself comes about.)
grep (filter) out the log messages which are superfluous (in our criteria, empty block lists from our core node, and transaction rejections & removals on all nodes). The rest of the log messages, if the
topology.json is clean, will be just the advancements of the tip & confirmations of transactions that provide a visual confirmation the whole thing is working properly.
Since I try to go for yoga in the morning I set it up so I can run a script from the desktop as soon as I roll out of bed in that direction, which opens up a window on our core & two relays that displays that filtering of the log files in windows tiled on the screen:
gnome-terminal --geometry=232x23+250+19 -- ssh -t spuser@core "bin/ctailf & bash -l"
gnome-terminal --geometry=232x22+250+506 -- ssh -t spuser@relay "bin/ctailf & bash -l"
gnome-terminal --geometry=232x22+250+967 -- ssh -t spuser@relay2 "bin/ctailf & bash -l"
If anything is a cause for alarm, it will stand out there. I resolved someday to have a script that looks for anomalous messages to send them by email & SMS through Google Voice, so they notify me via mobile device, but I was suprised to see how much reliability we got from just the usual cloud Ubuntu provisioning and a clean Linux setup with the new version of the node. I’ve found it’s really hard to kill, even deliberately.
Likewise, you can keep having a look at the logs in the same way whenever convienient, also in the knowledge that the thing is likely bulletproof if you’ve taken the right precautions about ports, SSH, security, VPC, etc.
Our visual filter of the log files, which can be adapted according to your favourite means of viewing, currently looks like this:
tail -fn20 ~/node.log | \
grep --line-buffered -v 'UTC] Not leading slot' | \
grep --line-buffered -v 'UTC] fromList \' |\
grep --line-buffered -v ',String..TraceMempoolRemoveTx' |\
grep --line-buffered -v ',String..TraceMempoolRejectedTx'
… and this is the current version of our so-far bulletproof
systemd script (installed in
ExecStart=/home/spuser/bin/cardano-node run --topology ./config/topology.json --database-path ./db --socket-path ./db/node.socket --port 5250 --config ./config/config.json
# NOTE these options are apparently ignored (systemd, after a slow kill, says "sending sigterm")
# Leaving in there because provided by Cardano specifications, so they may work someday...
# ... in the meantime, relying on reduced timeout (default = 90 seconds, set by systemd itself)
# (choosing this value relevant to the 21 seconds it takes for a full reboot)
# default = 4096 (may not be enough open files for state directory):