No luck finding an answer for this on IOHK Github, Telegram, or elsewhere on forum (my choices of search keywords were probably not very effective), or trial & error:
TL;DR we need a way to send a signal, a packet, some json, add/delete a file, or anything that will cause cardano-node to “exit nicely” before termination, so it doesn’t have to go through a huge DB consistency check & cleanup the next time it starts.
cardano-node is working fine for us & syncing all the way to the tip, but I’m restarting it a lot in the course of testing different topologies, logging configurations, fault tolerance, etc. Every time it gets a TERM signal (usually Ctrl-C from where I’m running it in the console) it takes nearly 40 minutes (on our 2-CPU cloud host) to regain sync:
a couple minutes after starting to “warm up” before it moves past the intitial block of messages…
then begins with a message like: Replayed block: slot SlotNo {unSlotNo = 4569140} of At (SlotNo {unSlotNo = 4758000})
and then about 20 minutes of rebuilding the database to get somewhere near to the current time, showing “progress” percentages like this, always starting in the mid-90% range (even if the node’s been down for less than a minute):
block replay progress (%) = 96.1
… mounting slowly all the way up to 100%
… then displaying a confirmation of some kind: Opened lgr db
(followed by a several minute delay)
and only then moving forward in the usual way, catching up more quickly until getting up to the current tip & then finally moving on in sync.
I can’t bring the relay online knowing any changes in topology would effectively bring the relay down for a period like this. We don’t have an under-provisioned host; it’s not “slow” in any way: we’re just seeing the common consequence of a program with a delicate data structure being “terminated abnormally” or not “gracefully shut down.”
kill -1 or some other signal is a common UNIX convention to get a command to restart, or sometimes re-read the config files. (Other UNIX daemons will re-read config files if they get a HUP / hangup signal & on a side note that would be grand if cardano-node had some means like that of re-reading topology.json.)
I’m now at my wits’ end since there will definitely be times we want to stop the node deliberately and we can’t leave the database so damaged every time it has to keep doing these long consistency checks before beginning the long process of re-syncing again. I’m hoping this is some piece of common knowledge that I’ve somehow missed, but if not then could this question please be fielded by one of the developers?
thanks @laplasz, that’s a sensible choice that I hadn’t seen, but this command by itself doesn’t do anything (maybe it needs to be passed another argument?):
I’m checking the community docs now, but please anyone else let us know if there’s another solution, or perhaps one of the devs can tell us how to use the command above
The above command may be unimplemented, but after about a day of relay syncing we’ve found that a fully synced node doesn’t take as long to restart & catch up again: about 6 minutes in the case below for our 2-CPU relay host. Our last cardano-node kill & restart (still have no better way than SIGINT / Ctrl-C) got to the fully synced stage with only a brief “block replay” stage:
[2020-08-05 13:35:39.61 UTC] CardanoProtocol
[2020-08-05 13:37:56.68 UTC] Opened vol db
[2020-08-05 13:38:01.71 UTC] Replaying ledger from snapshot DiskSnapshot 73 at (Point 5022120, …
[2020-08-05 13:38:02.00 UTC] Replayed block: slot SlotNo {unSlotNo = 5022140} of At (SlotNo {unSlotNo = 5023560})
[2020-08-05 13:38:06.30 UTC] block replay progress (%) = 100.0
[2020-08-05 13:38:08.33 UTC] Opened lgr db
[2020-08-05 13:41:07.13 UTC] Opened db with immutable tip at (Point 5023560, …
[2020-08-05 13:41:16.60 UTC] Chain extended, new tip: (Point 5068520, …
[2020-08-05 13:41:31.64 UTC] Chain extended, new tip: (Point 5068600, …
[2020-08-05 13:41:51.73 UTC] Chain extended, new tip: (Point 5068620, …
[2020-08-05 13:42:11.60 UTC] Chain extended, new tip: (Point 5068640, …
I would be happy for more responses, while in the meantime concluding the fully synced node has less to “clean up” after a restart than one which is still catching up with the tip… like our relays were for the first 8 hours or so after launch. This can also be seen by the Disk I/O, memory usage, and network traffic of those hosts.
I’d still like to know what cardano-cli shelley system stop is really supposed to do. Perhaps its implementation isn’t a priority with the incredible deadlines the development team has been facing down.
Good point @COSDpool to have gracefully restart mechanism. It feels now scary to “restart” the nodes. Always in my mind questioning if it will be up to sync or would the DB be corrupted?
Perhaps some feature to keep the nodes up & running without restarting:
Hot loading feature. Making cardano-node capable of loading certain files without restarting the node, e.g. topology. Every x period loading it automatically, perhaps a parameter that you can provide to poll?
Having active - active or active / passive node. A feature where you can point a passive node. If the active node needs e.g. an update (newer version) then the passive will take over as active without any losing it connection.
Here’s the latest I’ve heard (generally agreeing that restarting a synced node, if it’s all the way up, happens in seconds… maybe 2 or 3 minutes if it hasn’t restarted all the way)…