KES update/rollback tutorial for cntools install?

Hello,

I upgraded a BP last week. Everything seemed to go well, gLiveView reported blocks were being produced. However, my cncli was out of date, so I proceeded to upgrade that. I used prereqs.sh -c to upgrade the necessary files. Everything seemed to go fine.

For cntools I had to force the leaderlog to update, which I did. Now, my KES says it’s expired. Ok, so I went to look at kes period info and the end interval looked fine:

✓ Operational certificate's KES period is within the correct KES period interval
✓ The operational certificate counter agrees with the node protocol state counter
{
"qKesCurrentKesPeriod": 561,
"qKesEndKesInterval": 622,
"qKesKesKeyExpiry": null,
"qKesMaxKESEvolutions": 62,
"qKesNodeStateOperationalCertificateNumber": 12,
"qKesOnDiskOperationalCertificateNumber": 13,
"qKesRemainingSlotsInKesPeriod": 7887916,
"qKesSlotsPerKesPeriod": 129600,
"qKesStartKesInterval": 560
}

However, my disk and cert numbers are now off. I was reading about this before and found a coin cashew post on how to adjust the counter, but it didn’t work. Does anyone know of a different post/docs for guild/cntools users? I’ll keep digging. Thanks.

1 Like

Have you already minted a block after that you refreshed your certificate? If not, than the numbers are correct. Once you have produced a block the ondisk will match the Node number.

1 Like

Ok, first check on adapools.org → blocks (search for ur pool and go to blocks section) what OP number did u use for last blocks… then go to cntools-pool-rotate … it will ask for the counter number … enter the right value and press enter (+1 than the last used for blocks)… then restart the node and wait for the node to start as a BP…

1 Like

I thought we did. But when I restarted the node - after the cncli update - it only boots in relay mode now.

It’s weird, we’ve never had these issues before and I’m just trying to figure out where the problem is.

It also looks like the cncli leaderlog output is counting down. I dunno if there’s some bug in there somewhere, but I have another terminal opened monitoring it and this is what it’s showing:

CNCLI still syncing [99.9144 %], checking again in 60s
CNCLI still syncing [99.9143 %], checking again in 60s
CNCLI still syncing [99.9143 %], checking again in 60s
CNCLI still syncing [99.9142 %], checking again in 60s
CNCLI still syncing [99.9141 %], checking again in 60s

So, the update is ticking down. Very weird. It’s been doing this since yesterday. I’ve been digging around the cncli and other github projects and nothing jumps out at me.

I did check the cnode.sh script again and verified that all the pool env vars are set (they have been since we stood the node up), but it looks like the node is not starting with the right switches for BP mode - looks like the opcert and various sk files are not triggering in cnode.sh correctly.

I guess for now I’ll just dig in and start it manually (everything’s been service based since we launched the node and it’s always worked) and see what I find.

1 Like

Ok, can u share the glive output?
Also the output for journalctl -e -f -u cnode shows u errors? Check carefully the messages

1 Like

For some reason, the cnode.sh wasn’t seeing the proper env vars for the pool and it was starting in relay mode. I just manually started it on the command line with the BP switches and now it’s back up in Core mode. However, it’s saying no blocks produced; when I took a screenshot of gLiveView at the beginning of the epoch, it reported we had 2 blocks produced.

Actually, let me back up on that. We had the Leader and Adopted for 2. Now both of those are zero. Sounds like we missed confirming them. :frowning:

The KES rotation tho now shows up correctly in gLiveView. I still don’t know why cnode.sh doesn’t see those vars in the script - they haven’t changed since we setup the node. I checked cexplorer and our last block produced was using 12 op number. So, I think the key rotation is ok - it’s currently showing in gLiveView that it expires on 12/27/2022.

I suppose I have to now comb through the logs to try to find out what happened. I suspect it was something the cncli leaderlog, but not sure yet and to be honest, that project is not something I’ve spent a lot of time digging into to figure out how it works in detail.

Anyway, first order of business is to figure out why cnode.sh is not starting in BP mode and then I’ll dig through the logs. I will post an update when/if I find out what the issue is. Thanks for the replies. :+1:

1 Like

Nope, just reseted the counter

1 Like

Ok. Yah, like I said, that part of the system isn’t super familiar to me. :slightly_frowning_face: Thanks.

1 Like

Oh, and, does this cncli message output look valid?

CNCLI still syncing [99.9120 %], checking again in 60s
CNCLI still syncing [99.9119 %], checking again in 60s
CNCLI still syncing [99.9118 %], checking again in 60s
CNCLI still syncing [99.9117 %], checking again in 60s
CNCLI still syncing [99.9117 %], checking again in 60s
CNCLI still syncing [99.9115 %], checking again in 60s
CNCLI still syncing [99.9114 %], checking again in 60s

I mean, the sync going down? When I left it last night, it was in the high 80% range. This morning, in the high 90’s - as you can see. I’m guessing it’s doing the right thing, but there is just a bug in how it’s displaying progress.

1 Like

Type cncli - -version

1 Like
cncli 5.1.2

That was the latest I put us on yesterday. Used prereqs.sh and forced the cncli upgrade.

2 Likes

Do u have enough RAM?

1 Like

Yah. We’ve had this BP in a 16GB machine for a while. Ever since that was the requirement.

From gLiveView:

│- NODE RESOURCE USAGE ------------------------------------------------│
│ CPU (sys)  : 3.78%     Mem (Live) : 4.5G      GC Minor   : 8143      │
│ Mem (RSS)  : 6.6G      Mem (Heap) : 6.6G      GC Major   : 19        |
1 Like

It very possible that cncli kills the node, it will take more ram to calculate, u can check with top after u run the command

U can set a SWAP file in case u didn’t

1 Like

Yah, we setup swap on all our instances, so that’s covered. At this point, I guess I’ll just have to wait until the cncli says it’s done syncing. Not sure when that will be tho since it’s counting down, but I’ll assume it’s actually counting up :slight_smile: and will wait for it to complete.

And I’ll continue to dig, of course, to see more what cncli does to the other parts of the node. Thanks again for your help.

1 Like

The cncli leaderlog command is still counting down :slight_smile: but the node looks like it’s back to the counts we had before.

image

Just wanted to drop that update… I’m still spelunking.

1 Like

Alright, so no blocks for epoch 365. :slightly_frowning_face: I’m trying to figure out why but I’m guessing the KES didn’t rotate correctly.

366 has us with 2 blocks, but they’re not confirmed. Cncli is still doing its thing, so I’m not sure what’s going on with that. I still don’t fully know why it’s counting down backwards. I put a message out to the cntools telegram channel since some other people looked like they were having other issues.

For now, this is what gliveview says:

image

At this point, I’m just reading and trying to figure out where the disconnect is. Hopefully we get a confirmed block or two.

Hi all, just want to make sure about KES expiry date

When I use command
cardano-cli query kes-period-info --mainnet --op-cert-file op.cert

it shows this

✓ Operational certificate’s KES period is within the correct KES period interval
✓ The operational certificate counter agrees with the node protocol state counter

{
    "qKesCurrentKesPeriod": 563,
    "qKesEndKesInterval": 625,
    "qKesKesKeyExpiry": null,      <----should this be null?
    "qKesMaxKESEvolutions": 62,
    "qKesNodeStateOperationalCertificateNumber": 6,
    "qKesOnDiskOperationalCertificateNumber": 7,
    "qKesRemainingSlotsInKesPeriod": 7930409,
    "qKesSlotsPerKesPeriod": 129600,
    "qKesStartKesInterval": 563
}

My last block was with operational certificate number 6 and on disk is 7 so it should be fine but not sure why expiry date shows ‘null’ ? It shows correctly in gLiveView / cntools

The issue, it seems, that I was having was with the cntools.

First problem was we were out of date. So, I had to rebuild everything to the latest just to get gLiveView and other utils working again.

Next problem I had was the leaderlog was not updating. I kept seeing countdowns go negative after I wiped out the sqlite db. I had to do a lot of digging and (of course) it was something simple.

We have all our nodes on 16GB instances. I’m not sure why we set it up this way, but all our swaps where 4GB. When running all the services, we’d eventually start crashing the node. I upped the swap on our bp and relays to 16GB. (I don’t get why we need this much memory, but that’s for another day of research.) After doing this and rebooting, everything seemed to be happier. After we minted a block, our KES issue resolved itself and the cert numbers are now in line. I’ve verified this as well on cexplorer.io.

I’ve noticed that the swap will dip in to almost 8GB and while I don’t like using swap at all, I am just glad everything’s stable again. We’re producing blocks again and everything looks good now. I applied the settings to my env as well found here:

Good luck and I hope this might help someone else. To be clear, in our case it was: make sure all cntools utils are up to date and make sure you have 16GB ram (min) and 16GB swap.

hello. I am not sure with the cntools options.
if I had no blocks in the past period (3 months) and now I have rotated the keys. in cntools it is presented as follows. that’s right?
when I use the rotate keys option it has never asked me anything
image