Node Restart Time

Sometimes, when I restart my cardano node (I am currently on 8.1.2, but have noticed this on every version) it only takes a few minutes to get back to running, while other times the status remains at “starting” for several hours. Is there a reason for this and a way to ensure it only is down for a short amount of time when it restarts?

1 Like

You probably don’t stop the node the right way, so it has to do a revalidation on startup which can take a long time…

If you run the node as a service, make sure the service config file contains the following:

KillSignal=SIGINT
RestartKillSignal=SIGINT
TimeoutStopSec=120

The first two will ensure the right signal is sent for stopping or restarting the node. The last one will terminate the service forcibly by SIGKIL if it hasn’t stopped before the timeout is reached (which will trigger the revalidation on next startup). Adjust this timeout (in seconds) to what works for you.

More info on systemd.service.

3 Likes

This parameter isn’t required to properly terminate the node with SIGINT and I wouldn’t recommend adding it.

We had it set from the early days of our pool according to what we always thought were reasonable timeout values & that boundary kept being broken as the node evolved… and at each of those increases in proper shutdown time it caused at least one node crash with corrupted Immutable DB (the thing that takes a couple hours to rebuild after a crash).

Without TimeoutStopSec= the node will stop as soon as its DB is properly saved & its P2P connections properly terminated… any systemd settings try to make it stop before that point will leave the DB corrupted. In general @mcvetyty only you can say if you can use this parameter properly in your own node operation & hopefully this will help you make up your mind: https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#TimeoutStopSec=

I would never use this parameter again, especially after P2P was introduced to the Cardano node which, as far as I know, makes its shutdown time unbounded.

2 Likes

Thanks for the reply to both of you… @COSDpool - are you recommending removing the variable entirely? It was there when I first edited the service file a few hours ago and was set to something like 3 or 5, so I increased it to 120 per the suggestion of @brouwerQ, but are you saying having it removed completely is the best way now?

2 Likes

That’s why I said adjust to what works for you.

With that parameter set the node will also stop as soon as its DB is properly saved… It’s not that you need to wait the whole time.

Yes, but that parameter could also help in the unlikely event that the node shutdown goes in an infinite error or something. You can also set it to something really high instead of just ommiting it.

1 Like

??? The documentation you linked explicitly says:

Defaults to DefaultTimeoutStopSec= from the manager configuration file (see systemd-system.conf(5)).

And that is 90s on my system. So, not setting it, will be worse than setting it to something above 90s.

3 Likes

3 or 5 seconds will definitely be too short. In some cases it will be enough though, that’s why you don’t get the problem every time you restart the node.

DO NOT omit it, because the it will be set to whatever your system’s default value is, which might be too short. That’s why I set it in the first place, so I’m sure of what the value is (instead of a default). I’ve never encountered any problems with a value of 120 seconds. Apparently, you can also set it to the value infinity if you don’t want a timeout. See the link to the documentation from my first response.

2 Likes

I have my TimeoutStopSec value set to 300 because my ARM machine is quite slow. Mostly this is long enough that it isn’t hit. But, to my mind, this configuration is a trade-off.

Here are two examples, I just captured. Both are cardano-node relays running on otherwise identical ARM machines which have the equivalent of 4 raspberry pi processor cores and 24G ram each. They are a bit under-powered.

The first node restart hits the 300 second timeout.

When I tell systemd to restart cardano-node, I initially I see this journal message:

Nov 28 07:48:47 relay1 cardano-node[794652]: [relay1:cardano.node.DiffusionInitializationTracer:Critical:7] [2023-11-27 21:48:47.90 UTC] DiffusionErrored user interrupt

Then after 300 seconds the timeout is reached and systemd kills it. The journal prints:

Nov 28 07:53:41 relay1 systemd[1]: cardano-node.service: State ‘stop-sigterm’ timed out. Killing.
… a bunch of kill messages …
Nov 28 07:53:41 relay1 systemd[1]: cardano-node.service: Main process exited, code=killed, status=9/KILL

Let’s see how long it takes to get up running again:

I see the usual journal messages like:

[2023-11-27 21:53:45.42 UTC] Validating chunk no. 5069 out of 5069. Progress: 99.98%
[2023-11-27 21:53:49.09 UTC] Validated chunk no. 5069 out of 5069. Progress: 100.00%
[2023-11-27 21:53:49.10 UTC] Found a valid last location at chunk 5069 with tip 53acfa46df7d0765d08841ad9bca65af412039e27a49a82cf552b26b67c3def4@109510135.
[2023-11-27 21:53:49.11 UTC] Opened imm db with immutable tip at 53acfa46df7d0765d08841ad9bca65af412039e27a49a82cf552b26b67c3def4 at slot 109510135 and chunk 5069
[2023-11-27 21:53:49.11 UTC] Started opening Volatile DB
[2023-11-27 21:53:59.27 UTC] Opened vol db
[2023-11-27 21:53:59.27 UTC] Started opening Ledger DB

Then there is a seemingly worrying message:

[2023-11-27 21:59:36.25 UTC] Invalid snapshot DiskSnapshot {dsNumber = 109510135, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 2007257572 "end of input"))

Then 12 minutes later:

[2023-11-27 22:06:07.59 UTC] Replaying ledger from snapshot at 5023ee18aa49acddbc8b9220cdfc53e31364714d092353c96f9daf498a386cc8 at slot 109505270
[2023-11-27 22:06:10.00 UTC] Replayed block: slot 109505304 out of 109510135. Progress: 0.70%
[2023-11-27 22:06:13.30 UTC] Opened lgr db
[2023-11-27 22:06:13.30 UTC] Started initial chain selection
[2023-11-27 22:06:14.76 UTC] Pushing ledger state for block 651d091f5a57f422141bb3d5dec1ca738ace773f682590e11ff347810bf7d3a5 at slot 109510141. Progress: 0.00%

Finally:

[2023-11-27 22:09:33.54 UTC] Pushing ledger state for block 2cbe44e8[140/9235] 7dec06b775508ed7cf7f703674dcb74d507e937509b554 at slot 109555421. Progress: 100.00%
[2023-11-27 22:09:33.54 UTC] Valid candidate at tip 651d091f5a57f422141bb3d5dec1ca738ace773f682590e11ff347810bf7d3a5 at slot 109510141                        
[2023-11-27 22:09:33.54 UTC] Initial chain selected                           
[2023-11-27 22:09:33.55 UTC] Opened db with immutable tip at 53acfa46df7d0765d08841ad9bca65af412039e27a49a82cf552b26b67c3def4 at slot 109510135 and tip 2cbe44e8b10e4a1e9a7dec06b775508ed7cf7f703674dcb74d507e937509b554 at slot 109555421    
[2023-11-27 22:09:33.55 UTC] Will terminate upon reaching NoShutdown
[2023-11-27 22:09:33.55 UTC] CreateSystemdSocketForSnocketPath (LocalAddress "/run/cardano/mainnet-node.socket")
[2023-11-27 22:09:33.55 UTC] CreatedLocalSocket (LocalAddress "/run/cardano/mainnet-ode.socket")
[2023-11-27 22:09:33.55 UTC] ConfiguringLocalSocket (LocalAddress "/run/cardano/mainnet-node.socket") (FileDescriptor 32)
[2023-11-27 22:09:33.55 UTC] ListeningLocalSocket (LocalAddress "/run/cardano/mainnet-node.socket") (FileDescriptor 32)
[2023-11-27 22:09:33.55 UTC] LocalSocketUp (LocalAddress "/run/cardano/mainnet-node.socket") (FileDescriptor 32)

So this ARM relay took nearly 21 minutes to get back up and running. Though 5 minutes of that was cardano-node sitting in some loop before systemd eventually killed it. So we could say that it really took 16 minutes to restart. It apparently had to rebuild more of the ledger from an earlier snapshot.

The second node restart doesn’t hit the timeout:

Nov 28 08:17:28 relay2 cardano-node[810478]: Shutting down..                    
Nov 28 08:17:28 relay2 cardano-node[810478]: cardano-node: DiffusionError user interrupt                                                                        
Nov 28 08:17:37 relay2 systemd[1]: cardano-node.service: Deactivated successfully.                                                                              
Nov 28 08:17:37 relay2 systemd[1]: Stopped cardano-node.service - Cardano Node. 
Nov 28 08:17:37 relay2 systemd[1]: cardano-node.service: Consumed 1d 30min 45.31 0s CPU time.
Nov 28 08:17:37 relay2 systemd[1]: Started cardano-node.service - Cardano Node.

Nb. Local time is UTC+10.

And, after 10 minutes it is up and running:

[2023-11-27 22:27:43.92 UTC] Pushing ledger state for block 7aa6425753580988dd
7983a056ce2883600660a73b171c9e1c2be7b36fcd6c22 at slot 109557133. Progress: 100.00%                                                                             
[2023-11-27 22:27:43.92 UTC] Valid candidate at tip b8941561c9701e971930f8b5546a8e55bc6356f560bbd4647dce236e52fba429 at slot 109512082                        
[2023-11-27 22:27:43.92 UTC] Initial chain selected                           
[2023-11-27 22:27:43.92 UTC] Opened db with immutable tip at 924e3ffc456a6e328bdea56b4b4879fb92113f81f178775ec682c8b659e72648 at slot 109512049 and tip 7aa6425753580988dd7983a056ce2883600660a73b171c9e1c2be7b36fcd6c22 at slot 109557133
[2023-11-27 22:27:43.93 UTC] Will terminate upon reaching NoShutdown
[2023-11-27 22:27:43.93 UTC] CreateSystemdSocketForSnocketPath (LocalAddress "/run/cardano/mainnet-node.socket")
[2023-11-27 22:27:43.93 UTC] CreatedLocalSocket (LocalAddress "/run/cardano/mainnet-node.socket")

So if I take out the 5 minutes of systemd waiting to kill the seemingly endlessly looping cardano-node process, then the relay that hit the timeout only took around 6 minutes longer to restart. This is why I think the timeout value is a trade-off. For example, if I set it to 10 minutes and my slow ARM relay takes say 9 minutes of that to save it’s ledger properly, or whatever it is trying to do, then the total restart time will be longer than if I let systemd kill it after 5 minutes.

I don’t understand enough about what cardano-node is trying to do when the 300s timeout is reached on my ARM machines. Maybe it successfully saves part of the ledger state properly before systemd kills it? In other words, I could be worse off if I set the systemd TimeoutStopSec value to less than 300s because maybe that would result in a complete rebuild of the ledger taking several hours??? Who knows?

3 Likes

By the way, if you leave RestartKillSignal unset then it will default to the value of KillSignal. Systemd man page for RestartKillSignal says:

Not set by default, and the value of KillSignal= is used.

3 Likes