Node problem following bad KES rotation, fixed but log: HandshakeError (VersionMismatch)

krista · 22 November 2024 17:43

Hubris. After running our node for a couple of years now, I’ve managed to break something and would like to ask for some assistance - I have not been able to start the node. It appears to start but log entries indicate a problem and I’m rather stuck.

This followed a few days after a smooth update to 10.1.1. After no issues, time to rotate the KES (13 days before expiration). Can’t count the number of successful rotations and since the original node setup was based on the Coincashew guides, decided to try the FRADA script which is where I went astray - my fault, not the script.

I ran the script twice - had not realized an error in the first pass and then accidentally re-ran it. Ouch - it bumped the node counter which I reset and verified (cardano-cli conway query kes-period-info --mainnet --op-cert-file node.cert).

The node remains “starting” and the log entries are not helpful. I can see there is a problem but do not understand the error, log excerpt below. Relays and BP are 10.1.1 I’ve run through a series of checks but now concerned I’ll make things worse.

Would appreciate any assistance or guidance that you could offer.

BP log:

=

[bone:cardano.node.ConnectionManager:Info:5048] [2024-11-22 16:59:52.68 UTC] TrConnectionHandler (ConnectionId {localAddress = 127.0.0.1:6000, remoteAddress = 127.0.0.1:52494}) (TrHands hakeServerError (HandshakeProtocolError (HandshakeError (VersionMismatch [NodeToNodeV_13,NodeToNodeV_14] ))))

[bone:cardano.node.ConnectionManager:Info:5047] [2024-11-22 16:59:52.68 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 0, unidirectional Conns = 0, inboundConns = 0, outboundConns = 0})

[bone:cardano.node.LeadershipCheck:Info:4729] [2024-11-22 16:59:53.00 UTC] {“chainDensity”:4.8154343e-2,“credentials”:“Cardano”,“delegMapSize”:1314018,“kind”:“TraceStartLeadershipCheck” ,“slot”:140728502,“utxoSize”:11160220}

[bone:cardano.node.Forge:Error:4729] [2024-11-22 16:59:53.02 UTC] fromList [(“credentials”,String “Cardano”),(“val”,Object (fromList [(“kind”,String “TraceNoLedgerView”),(“slot”,Number 1.40728502e8)]))]

[bone:cardano.node.LeadershipCheck:Info:4729] [2024-11-22 16:59:54.00 UTC] {“chainDensity”:4.8154343e-2,“credentials”:“Cardano”,“delegMapSize”:1314018,“kind”:“TraceStartLeadershipCheck” ,“slot”:140728503,“utxoSize”:11160220}

[bone:cardano.node.Forge:Error:4729] [2024-11-22 16:59:54.03 UTC] fromList [(“credentials”,String “Cardano”),(“val”,Object (fromList [(“kind”,String “TraceNoLedgerView”),(“slot”,Number 1.40728503e8)]))]

jeremyisme · 22 November 2024 18:14

My only suggestion is that there’s a version that’s different somewhere, but as you said, they’re all running 10.1.1. Did you update the config for all the nodes when you were upgrading?

krista · 22 November 2024 19:21

After a few hours, the node switches to validating chunks

krista · 22 November 2024 19:29

Upgraded from 9.1.0 to 10.1.1 - so conway changes were in place for this upgrade and I made no changes to any of the config files. There was a change at compile for the HsOpenSSL since I’m on Ubuntu 22 and I’m pretty sure it was in place for each relay and the BP.

Wondering if I should try to down-grade back to 9.1 but again - making things worse, more convoluted is a risk so I’d like to avoid that step if possible.

jeremyisme · 22 November 2024 19:50

Ah, I think you just need to leave it to replay. There’s a replay for 10.1.1 that you’ll need to leave to finish. It took mine quite a few hours.

I also think 10.1.2 compiles without the ssl switch on 22 now.

kirael · 22 November 2024 23:44

Hello there

What type of error did you encounter while running the script ?

krista · 23 November 2024 13:26

@jeremyisme Gave the replay another 15hrs but it never started. It continued the cycle of errors, above, replay, then repeat with each period lasting an hour or 2. Then 2 hrs ago, I restored the BP back to 9.1. We’re off the schedule for this session so I have a little flexibility to get it running. However, the 9.1 is not working - errors look similar. I’ve used the spot check (Config File Check | Cardano SPOT Check) information to validate config files but with 9.1 not working…

@kirael Thank you so much for the script - it is a great contribution and much appreciated. The error was mine - I ran 2.0, twice. I did not realize the initial kes period query failed so that null was carried forward.

At this point - organize cold keys and start from scratch, nuke it from orbit?

Current log output.

jeremyisme · 23 November 2024 18:40

What does your gLiveView show? Is your node synced and just having trouble connecting to the other nodes? Or is it struggling to make connections?

krista · 24 November 2024 00:59

Interesting - I’d been focused on the log but gLiveView no longer shows ‘starting’ but reflects ‘syncing’ with KES expired. The relays remain ‘starting’. No network changes or upgrades to the hosts but your point re: connection trouble seems on point.

2 mins after this screen cap, gLiveView rolled over - status is no longer ‘synching’ but back to ‘starting’ - see bottom pic

jeremyisme · 24 November 2024 17:37

So the core is rolling back to “starting”, or just the relays?

Your kes key has expired too. Did you set the right StartingKESPeriod?

krista · 24 November 2024 21:14

I did reset the StartingKESPeriod for the rotation, correction.

Core and relays are ‘starting’. In respective logs, each on their own host, they’re replaying the ledger. I restored the 9.1 binaries and restarted - I understand if I’d broken configs that the core might still have problems that I can’t yet solve but the relays haven’t been touched, aside from the binary restore (to my prev compiled version 9.1).

Nov 24 16:10:16 host01 cardano-node[201490]: [relay01:cardano.node.ChainDB:Info:5] [2024-11-24 21:10:16.54 UTC] Replayed block: slot 98128747 out of 140518402. Progress: 69.83%

Advice or guidance if I have to go back to the beginning (short of registration, etc)? I’m at a loss and hate letting down our delegators with this down-time.

jeremyisme · 24 November 2024 21:27

Have a look into bootstrapping via mithril. That might save you some replay time. I haven’t tackled it myself but others say it works well for them. That might help get through the starting phase faster so you can troubleshoot other things.

Have you followed a guide for your setup?

krista · 24 November 2024 22:13

My teammates had asked about mithril but I haven’t had the chance to look into it. Funny - I was worried about risking downtime with it.

Hybrid Coincashew after starting with custom scripts.

jeremyisme · 25 November 2024 02:22

Let me know once the relay has finished its replay. I think get one of those back up and running then focus on the BP.

I run a coincashew setup and haven’t had an issue like you have, but I think it might be a combination of config changes. There was a recent config.json change that required a min version, that might be causing a conflict.

I assume you have been running a 9.1 version without any trouble recently?

krista · 25 November 2024 21:51

Good advice, thanks. Replay under way - nearly there.

Yes - all was well under 9.1. I made that upgrade in Sept.

krista · 1 December 2024 14:34

Thanks team - got it. We’re back up - wanted to be sure before reporting success. In the end, I deleted the db and let it run. Hurt being down for so long waiting and watching, hands off. Appreciate the help, thank you.

Topic		Replies	Views
Updated KES and now BPN is stuck in starting Operate a Stake Pool	4	604	25 November 2021
After more than a week running, now the producer node get this exception "Could not update KES key" and stops Operate a Stake Pool haskell , cardano-node	4	1278	10 July 2020
Testnet Handshake Error (VersionMismatch [NodeToNodeV_6] [7]) Operate a Stake Pool	7	1086	9 January 2022
KES Rotation Help- Node.cert not Producing new file "Air Gapped PC" Operate a Stake Pool	2	374	17 November 2021
KES keys expired - trouble rotating Operate a Stake Pool	55	2116	29 September 2021

Node problem following bad KES rotation, fixed but log: HandshakeError (VersionMismatch)

Related topics