Node problem following bad KES rotation, fixed but log: HandshakeError (VersionMismatch)

Hubris. After running our node for a couple of years now, I’ve managed to break something and would like to ask for some assistance - I have not been able to start the node. It appears to start but log entries indicate a problem and I’m rather stuck.

This followed a few days after a smooth update to 10.1.1. After no issues, time to rotate the KES (13 days before expiration). Can’t count the number of successful rotations and since the original node setup was based on the Coincashew guides, decided to try the FRADA script which is where I went astray - my fault, not the script.

I ran the script twice - had not realized an error in the first pass and then accidentally re-ran it. Ouch - it bumped the node counter which I reset and verified (cardano-cli conway query kes-period-info --mainnet --op-cert-file node.cert).

The node remains “starting” and the log entries are not helpful. I can see there is a problem but do not understand the error, log excerpt below. Relays and BP are 10.1.1 I’ve run through a series of checks but now concerned I’ll make things worse.

Would appreciate any assistance or guidance that you could offer.

BP log:

=

[bone:cardano.node.ConnectionManager:Info:5048] [2024-11-22 16:59:52.68 UTC] TrConnectionHandler (ConnectionId {localAddress = 127.0.0.1:6000, remoteAddress = 127.0.0.1:52494}) (TrHands hakeServerError (HandshakeProtocolError (HandshakeError (VersionMismatch [NodeToNodeV_13,NodeToNodeV_14] ))))

[bone:cardano.node.ConnectionManager:Info:5047] [2024-11-22 16:59:52.68 UTC] TrConnectionManagerCounters (ConnectionManagerCounters {fullDuplexConns = 0, duplexConns = 0, unidirectional Conns = 0, inboundConns = 0, outboundConns = 0})

[bone:cardano.node.LeadershipCheck:Info:4729] [2024-11-22 16:59:53.00 UTC] {“chainDensity”:4.8154343e-2,“credentials”:“Cardano”,“delegMapSize”:1314018,“kind”:“TraceStartLeadershipCheck” ,“slot”:140728502,“utxoSize”:11160220}

[bone:cardano.node.Forge:Error:4729] [2024-11-22 16:59:53.02 UTC] fromList [(“credentials”,String “Cardano”),(“val”,Object (fromList [(“kind”,String “TraceNoLedgerView”),(“slot”,Number 1.40728502e8)]))]

[bone:cardano.node.LeadershipCheck:Info:4729] [2024-11-22 16:59:54.00 UTC] {“chainDensity”:4.8154343e-2,“credentials”:“Cardano”,“delegMapSize”:1314018,“kind”:“TraceStartLeadershipCheck” ,“slot”:140728503,“utxoSize”:11160220}

[bone:cardano.node.Forge:Error:4729] [2024-11-22 16:59:54.03 UTC] fromList [(“credentials”,String “Cardano”),(“val”,Object (fromList [(“kind”,String “TraceNoLedgerView”),(“slot”,Number 1.40728503e8)]))]

My only suggestion is that there’s a version that’s different somewhere, but as you said, they’re all running 10.1.1. Did you update the config for all the nodes when you were upgrading?

1 Like

After a few hours, the node switches to validating chunks

1 Like

Upgraded from 9.1.0 to 10.1.1 - so conway changes were in place for this upgrade and I made no changes to any of the config files. There was a change at compile for the HsOpenSSL since I’m on Ubuntu 22 and I’m pretty sure it was in place for each relay and the BP.

Wondering if I should try to down-grade back to 9.1 but again - making things worse, more convoluted is a risk so I’d like to avoid that step if possible.

Ah, I think you just need to leave it to replay. There’s a replay for 10.1.1 that you’ll need to leave to finish. It took mine quite a few hours.

I also think 10.1.2 compiles without the ssl switch on 22 now.

Hello there :slight_smile:

What type of error did you encounter while running the script ?

@jeremyisme Gave the replay another 15hrs but it never started. It continued the cycle of errors, above, replay, then repeat with each period lasting an hour or 2. Then 2 hrs ago, I restored the BP back to 9.1. We’re off the schedule for this session so I have a little flexibility to get it running. However, the 9.1 is not working - errors look similar. I’ve used the spot check (Config File Check | Cardano SPOT Check) information to validate config files but with 9.1 not working…

@kirael Thank you so much for the script - it is a great contribution and much appreciated. The error was mine - I ran 2.0, twice. I did not realize the initial kes period query failed so that null was carried forward.

At this point - organize cold keys and start from scratch, nuke it from orbit?

Current log output.

What does your gLiveView show? Is your node synced and just having trouble connecting to the other nodes? Or is it struggling to make connections?

Interesting - I’d been focused on the log but gLiveView no longer shows ‘starting’ but reflects ‘syncing’ with KES expired. The relays remain ‘starting’. No network changes or upgrades to the hosts but your point re: connection trouble seems on point.

2 mins after this screen cap, gLiveView rolled over - status is no longer ‘synching’ but back to ‘starting’ - see bottom pic

So the core is rolling back to “starting”, or just the relays?

Your kes key has expired too. Did you set the right StartingKESPeriod?

I did reset the StartingKESPeriod for the rotation, correction.

Core and relays are ‘starting’. In respective logs, each on their own host, they’re replaying the ledger. I restored the 9.1 binaries and restarted - I understand if I’d broken configs that the core might still have problems that I can’t yet solve but the relays haven’t been touched, aside from the binary restore (to my prev compiled version 9.1).

Nov 24 16:10:16 host01 cardano-node[201490]: [relay01:cardano.node.ChainDB:Info:5] [2024-11-24 21:10:16.54 UTC] Replayed block: slot 98128747 out of 140518402. Progress: 69.83%

Advice or guidance if I have to go back to the beginning (short of registration, etc)? I’m at a loss and hate letting down our delegators with this down-time.

Have a look into bootstrapping via mithril. That might save you some replay time. I haven’t tackled it myself but others say it works well for them. That might help get through the starting phase faster so you can troubleshoot other things.

Have you followed a guide for your setup?

My teammates had asked about mithril but I haven’t had the chance to look into it. Funny - I was worried about risking downtime with it.

Hybrid Coincashew after starting with custom scripts.

Let me know once the relay has finished its replay. I think get one of those back up and running then focus on the BP.

I run a coincashew setup and haven’t had an issue like you have, but I think it might be a combination of config changes. There was a recent config.json change that required a min version, that might be causing a conflict.

I assume you have been running a 9.1 version without any trouble recently?

Good advice, thanks. Replay under way - nearly there.

Yes - all was well under 9.1. I made that upgrade in Sept.

Thanks team - got it. We’re back up - wanted to be sure before reporting success. In the end, I deleted the db and let it run. Hurt being down for so long waiting and watching, hands off. Appreciate the help, thank you.

1 Like