Pool maintenance routines

Adrem · 30 August 2020 09:47

Good evening everyone,

I hope this finds you well. I am curious (if people are willing to share) regarding different approaches to maintenance of a running pool.

Do most of you aim at having the routine maintenance automated by scripts? Is it naive to think of doing it manually?

An example: 1 core, talking to multiple relays (all of them cloud-based) and a “back-up” core that has the same specs as the main one (maybe in local). The latter only runs when the main core is down.

The relays go down for reboot/update/cleanup etc in sequence, start up again (once per day, each relay);
The core node goes down for maintenance, the backup starts running (daily);

How often do you tend to your systems? What sort of alerts do you setup in your monitoring consoles? How much of your daytime/nighttime is dedicated to maintenance and monitoring?

Any suggestions/comments welcome.

All the best.

Cheers,

A

COSDpool · 30 August 2020 15:29

Here are some things I needed to do to feel comfortable taking my eyes off the stake pool node log files every once in a while, after the initial days of staring at them like that creepy guy in The Matrix:

Make browser bookmarks to general performance metrics for cloud resources for each node (like DigitalOcean droplet graphs, a bit simple minded but effective; or the AWS CloudWatch pages, not the detailed paid upgrade but just the free ones) to be sure RAM or CPU use isn’t clipping that the maximum for any of the instances.
Install a systemd script that restarts the node in way that is absolutely bullet proof. Keep trying to kill that node by sending it every signal in the book & see if you can get it to fail to restart. Keep rebooting the node to see if you can make the cardano-node not start.
Make sure nothing delays restarting the server itself. Ubuntu for instance will have a 90 second timeout if any systemd service (usual suspect = cardano-node) doesn’t “exit gracefully” when sent a kill signal. You can change that timeout from 90 seconds to 2 seconds safely on an an SSD based system and then your AWS node or DO droplet can bounce up & down like a basketball: with an SSD root, 20 to 25 seconds for a full reboot including starting cardano-node.
Keep reading log files in SimpleView until you can recognise signs of trouble, like troublesome peers in topology.json. I started keeping a file called shitlist in the config directory and adopted a 3-strikes-and-you’re-out approach for repeat appearances of “connection refused” for any relay in the log files. (This manual part of the process will go away in the near future when the promised topology management by the node itself comes about.)
Then grep (filter) out the log messages which are superfluous (in our criteria, empty block lists from our core node, and transaction rejections & removals on all nodes). The rest of the log messages, if the topology.json is clean, will be just the advancements of the tip & confirmations of transactions that provide a visual confirmation the whole thing is working properly.

Since I try to go for yoga in the morning I set it up so I can run a script from the desktop as soon as I roll out of bed in that direction, which opens up a window on our core & two relays that displays that filtering of the log files in windows tiled on the screen:

gnome-terminal --geometry=232x23+250+19 -- ssh -t spuser@core "bin/ctailf & bash -l"
gnome-terminal --geometry=232x22+250+506 -- ssh -t spuser@relay "bin/ctailf & bash -l"
gnome-terminal --geometry=232x22+250+967 -- ssh -t spuser@relay2 "bin/ctailf & bash -l"

If anything is a cause for alarm, it will stand out there. I resolved someday to have a script that looks for anomalous messages to send them by email & SMS through Google Voice, so they notify me via mobile device, but I was suprised to see how much reliability we got from just the usual cloud Ubuntu provisioning and a clean Linux setup with the new version of the node. I’ve found it’s really hard to kill, even deliberately.

Likewise, you can keep having a look at the logs in the same way whenever convienient, also in the knowledge that the thing is likely bulletproof if you’ve taken the right precautions about ports, SSH, security, VPC, etc.

Our visual filter of the log files, which can be adapted according to your favourite means of viewing, currently looks like this:

tail -fn20 ~/node.log | \
        grep --line-buffered -v 'UTC] Not leading slot' | \
        grep --line-buffered -v 'UTC] fromList \[]' |\
        grep --line-buffered -v ',String..TraceMempoolRemoveTx' |\
        grep --line-buffered -v ',String..TraceMempoolRejectedTx'

… and this is the current version of our so-far bulletproof systemd script (installed in /usr/lib/systemd/system/cardano.service)

[Unit]
Description=Cardano Node
# https://www.freedesktop.org/software/systemd/man/systemd.unit.html#StartLimitIntervalSec=interval
StartLimitIntervalSec=0
After=network.target

[Service]
ExecStart=/home/spuser/bin/cardano-node run --topology ./config/topology.json --database-path ./db --socket-path ./db/node.socket --port 5250 --config ./config/config.json
Type=simple
User=spuser
Group=spuser
WorkingDirectory=/home/spuser
SyslogIdentifier=cardano-node
# https://www.freedesktop.org/software/systemd/man/systemd.exec.html#StandardOutput=
StandardOutput=append:/home/spuser/node.log
StandardError=append:/home/spuser/node.log
# https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillSignal=
# NOTE these options are apparently ignored (systemd, after a slow kill, says "sending sigterm")
# Leaving in there because provided by Cardano specifications, so they may work someday...
KillSignal=SIGINT
RestartKillSignal=SIGINT
# ... in the meantime, relying on reduced timeout (default = 90 seconds, set by systemd itself)
# https://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStopSec=
# (choosing this value relevant to the 21 seconds it takes for a full reboot)
TimeoutStopSec=2
# default = 4096 (may not be enough open files for state directory):
LimitNOFILE=32768
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

COSDpool · 30 August 2020 15:49

I realise also you may have been thinking of “maintenance” in the traditional sense, e.g. package & kernel upgrades, etc. This is what we have resolved:

Whenever we have to upgrade the node (as with last release 1.19.0), we’ll do our package upgrades then… as in cp in the new binary, apt update, apt upgrade, reboot.
This is least likely to cause a disruption since the core node has to face 20 to 25 seconds of unavailability from a restarting system, but often the nodes take about 30 seconds to restart anyway so it’s not proportionally significant.

Keep in mind apt upgrade or equivalent doesn’t always require rebooting the system… generally only kernel upgrades, network stack, systemd related things, or some universal library like libc. However Ubuntu kernel updates are likely to be far more frequent than new versions of cardano-node, so we have provisioned for a reboot upon every node software release as a matter of course. That will keep us far enough ahead of Ubuntu security releases in my opinion.

Adrem · 31 August 2020 20:13

hi @COSDpool,

thank you for the detailed contribution! I have had to take a day, so I apologise for my belated reply.

This gives me a lot of food for thought, I have to keep learning! I will try to replicate your suggestions and, if I may, will ask questions along the way. In the meantime, regarding your shitlisting idea:

I am still in testnet as you may have gathered;
Grafana is what I use to monitor (ATM one of the relays only) via prometheus;
I’ll try to add the refused connections metric to my dashboard and start from there;
If you have time, could you please “throw everything” as you say above, to my node?

IP is 54.251.68.18:3663 this would give me the opportunity to fix security if needed and also see on my end if the node refuses incoming connections (is this correct?)

Thank you again,

A

COSDpool · 1 September 2020 11:25

The only thing I could try to do is DoS or maybe DDoS, if I had a bot army, that IP address & port number. Maybe now that you’ve posted that information here, someone else will go for it There is some evidence that relay nodes are already being probed:

… but in my measurements over the last week or so it’s only about 300 bogus connections per day: not that different from a typical web server or random failed login attempts. The consensus also with @_ilap is that the node should be able to handle the spurious connections and that no additional manual work is required.

Anyone who want to verify that number 300 could run something like this (assuming they have SimpleView logging… I don’t know if this would work for other logging schemes), which would clump the bad connections from particular IP addresses:

grep ErrorPolicySuspendPeer node.log | awk '{print $NF}' | sed 's/"//g' | sort | more

I have mostly been running that shitlist to keep the log files clear of habitual offenders and thereby make SimpleView logging easier to read, so if you’re only using the Grafana dashboard that’s not an issue for you.

The resilience of the node itself is covered here in this blog topic, including the development of what they’ve been calling the “peer to peer governor” which will automatically manage node topologies, presumably refusing connections from known bad guys:

https://iohk.io/en/blog/posts/2020/07/09/cardanos-path-to-decentralization-by-marcin-szamotulski/

According to this schedule that connection quality management will be coming out “soon” and according to Charles’ last AMA the developers returning in September are working on the peer-to-peer governor as we speak.

Adrem · 1 September 2020 11:29

Thanks COSDpool,

I have posted something in relation to this in another thread, so answer there if you like/prefer. I am just looking for a metric to track incoming/rejected connections. Do I look for this in cardano-node metrics or others (eg ones exported by node_exporter)?

Your answers are extremely useful for someone like me, thank you for your time!

Cheers,

A

COSDpool · 1 September 2020 11:31

I think I added this as an edit while you were posting that last message, and they crossed each other. If your metrics allow you to flag appearances of ErrorPolicySuspendPeer then I think that would be a good place to start.

_ilap · 1 September 2020 11:46

Nope, nodes won’t have any information about other peers but only their IPs. Nodes would maintain 3 buckets for peers, cold (few hundreds), warm ( around 50 with proven bearer connection), and hot (around 20) with bearer, and running mini-protocols. If some peer would behave dodgy it will be simply disconnected and cleared from the buckets.

COSDpool · 1 September 2020 12:02

thanks, then the only difference between what you are saying & what I said was that the peers labelled as dodgy will be allowed to re-connect. It’s not what I would have expected but I can see the sense in that… e.g. malfunctioning nodes would be able to re-enter the topology after their problems are fixed.

_ilap · 1 September 2020 12:10

Everybody is free to join and leave whenever they want, the administration overhead would be much more expensive than that the system can gain. The adversaries can Change their ip port, or even use some zombie network. This seems to me Coming from some fears. Look at Bitcoin, it is online publicly faced to the internet, that is how a protocol should evolve, like constant fighting on the battlefield.So, do not think too much on it. The threat not coming from some malfunctioning nodes, they will come from some other sources.

COSDpool · 11 September 2020 10:45

I should add in hindsight that some measures should also be taken to back up the node --database-path directory. What we have taken to doing in the meantime since the above was written…

Since we get visual confirmation at least once a day that the core & relays are advancing the tip in tandem, are all showing transactions appearing in the network, and are not choked with errors… at that point we run

nice rm -rf ~/db.old; nice cp -pr ~/db ~/db.old

(as an alias) which takes a quick DB directory snapshot.

There are means with DigitalOcean “block storage” and AWS EBS to keep this storage independently of the node. But I anticipate in the kind of software accidents that could wreck your copy of the blockchain and/or ledger, the node itself is exactly where you want that backup to be. If the whole node system gets trashed, you’ll have 1 or 2 redundant backups to access via rsync from another node.

Another good time to snapshot that DB directory is before any upgrade to a new version of cardano-node.

_ilap · 11 September 2020 21:11

I would use rsync, much cheaper and more effective.

Adrem · 7 October 2020 01:40

hi @COSDpool,

I hope this finds you well. I have finally gotten to the part where I attempt to make cardano-node into a systemd service. I modified the script to suit my setup. However, when I attempt to start the service I get this error:

Failed to start cardano.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files

What I describe above does not happen on my local machine, only on a AWS ec2 instance. Do you know why this happens?

Let me know if you need to see the modified .service file.

EDIT after some local testing: I have noticed that the files within the DB folder (immutable, ledger and volatile) become locked behind root privileges. This seems to prevent running the node as anything BUT a systemd service…I manage to stop it, and it restarts very quickly and well, but if I want to run the node in say LiveView through the command line (as I have become accustomed to doing) I cannot. Is this expected?

Thank you in advance for your time,

A

COSDpool · 7 October 2020 10:49

If Policy Kit is getting involved (which you normally see when it steps in to require a sudo user account password to run things as root), you probably have to add the cardano service commands as exceptions in a sudoers.d script.

You can try adding a file (where e.g. spuser is the name of the SPO user account configured in cardano.service):

/etc/sudoers.d/99-spuser (file name is not important)

with the contents

spuser ALL= NOPASSWD: /usr/bin/systemctl stop cardano
spuser ALL= NOPASSWD: /usr/bin/systemctl start cardano
spuser ALL= NOPASSWD: /usr/bin/systemctl disable cardano
spuser ALL= NOPASSWD: /usr/bin/systemctl restart cardano
spuser ALL= NOPASSWD: /usr/bin/systemctl enable cardano

p.s. I’ve just seen your EDIT question and the only thing I can suggest is that the root-owned files are artefacts from running cardano-node sometimes as root and sometimes maybe as spuser. I would recommend, once you’re sure cardano-node is able to run as spuser without trouble from systemd, to stop the node & run this as root to rationalise the file ownerships once & for all:

chown -R spuser: ~spuser/db (or wherever your DB is located)

Adrem · 8 October 2020 00:28

hi @COSDpool,

thank you for the prompt reply. As follows:

"p.s. I’ve just seen your EDIT question and the only thing I can suggest is that the root-owned files are artefacts from running cardano-node sometimes as root and sometimes maybe as spuser . I would recommend, once you’re sure cardano-node is able to run as spuser without trouble from systemd , to stop the node & run this as root to rationalise the file ownerships once & for all:

chown -R spuser: ~spuser/db (or wherever your DB is located) "

I have tried this bit on local, and the service works perfectly now. However, all systemctl commands appear to require a password (except for status). The issue was I had not specified my user in the unit, so it was creating root privileged files.

As to the rest of your suggestions, I modified the unit files to reflect my local changes also on the AWS instance. I also added the file as specified by you in the sudoers.d directory. I have found a couple of problems:

these operations all required me to act as root;
the standard user name which is created by default by AWS when launching a new ec2 instance does not require a password to turn into root (security issue?);
it still will not let me run the service as the default user;
when I try to run the service as root (not ideal, I just wanted to see if I could), the necessary libraries are not loaded (I imagine this is because the modified .bashrc file is not available to root and the .local/bin where cardano-node is not in the $PATH) (eg I get a code=exited status=127 failed state);
systemd also seems not to like StartLimitIntervalSec=0 and RestartKillSignal=SIGINT, so I commented these lines out (however this is not an issue on my local ubuntu for some reason).

I have formed the impression that having Cardano run as a systemd service makes it faster to boot and less prone to being accidentally killed and able to restart no matter what (eg if the AWS instance where to stop and restart, it would restart the node with). However, I have a couple of conceptual issues:

how (other than querying the node.log file) am I able to interact with the node and observe it on the server? I have become used to watching it run as LiveView and that can no longer happen, or am I wrong? I have installed a monitoring console on my local machine, so I guess that would not be too much of an issue, and AWS have their own (free) monitoring console, so I guess I could add that;
before understanding this whole systemd service thing, I was starting and stopping nodes manually. It takes me about a minute to reboot a AWS instance and start the node back up. I guess with a lot of practice I could bring it down to 40 seconds. Most of the time is wasted by the actual reboot (going back to above). This would not be an issue for the relays, as they can go down in turn, keeping the BP connected. For the BP 60 secs per day doesn’t seem too bad, what is your opinion?

Sorry for the lengthy reply, the more I dig the more I find new questions

Thank you in advance for your time,

A

COSDpool · 8 October 2020 11:16

I think then you also have to add this line with the others, and it can be in the same /etc/sudoers.d/* file:

spuser ALL=(ALL) NOPASSWD:ALL

That’s because the default AWS instance already has such a line added by cloud-init when the instance is created. You should see the corresponding line for the default user here in the same directory: 90-cloud-init-users

These parameters also work well for me on Ubuntu. For Amazon Linux & others, your mileage may vary. Every time a systemd parameter seemed system dependent, or differed from a cardano-node systemd template I’d seen before, I added a web link to the corresponding documentation page. Here are some other templates for you to pick & choose parameters from:

Topic		Replies	Views
Need suggestions from experienced pool runners Operate a Stake Pool	3	377	25 August 2020
Periodically restarting nodes? Operate a Stake Pool	15	1009	29 June 2021
Updating relay topology Setup a Stake Pool topology	2	745	22 July 2021
Pool Operators scripts to help you manage your node (+ guide) Operate a Stake Pool	1	514	13 February 2020
Information on Starting a Cardano Node Operate a Stake Pool security , cardano , cardano-node	4	959	28 December 2021

Pool maintenance routines

Related topics