Renew pool op.cert before KESPeriod expires

The current ff testnet Genesis defines

"epochLength": 21600,
"slotsPerKESPeriod": 3600,
"slotLength": 1,
"maxKESEvolutions": 120,

This means we have an EpochLength 21600 slots
As slotLength (duration) is 1 second, this means one Epoch lasts 6 hours and we have 4 epochs every day.

The slotsPerKESPeriod is set to 3600 which means we have 6 KES periods in every epoch, and one KES-Period lasts 1 hour

Last we have a maxKESEvolutions = 120 which means after 120 hours (5 days) the pools op.cert must be renewed.

So as one KESPeriod last 1h we count hours since genesis

UTCoffset=0
startTime=$(cat files/genesis.json | jq -r .startTime)
KESperiods=$((($(date +%s)-$(date +%s --date $startTime))/3600))

then we generate a new op-cert, and set kes-period to one hour before current kesPeriod (hours since genesis)

cardano-cli shelley node issue-op-cert \
	--hot-kes-verification-key-file $kesvkey \
	--cold-signing-key-file $coldskey \
	--operational-certificate-issue-counter $coldcounter \
	--kes-period $((${KESperiods}-${UTCoffset}-1)) \
	--out-file $opcert

now we put this in front of the node startup script, in order to generate a fresh op.cert valid for 5 more days on every node start.

Note: this is not a permanent solution. First because the cold keys for security reasons should not be accessible this way on the node. Second because we expect soon the ability to inject new op.certs through cli commands without restarting the node.

7 Likes

^^^ edited the original post to optionally also calculate a timezone offset in case your Server does not run in UTC tz

1 Like

I would not advise this as a solution. The real reason we have put such a short KES period is for us to build practices and tooling to manage the cold keys.

Get a protected USB, get a cold laptop/computer and build some healthy habits :wink:

2 Likes

right.
but first step (without any clear documentation given) is to figure out how to find and set the right KES period.
Then clearly next step is to develop a remote request, receiving the offline calculated new certificate, and restarting the node.

1 Like

Posted on TG, but thought it would be ok to post it here too:

Ok, I got my head around the KES Evolution process.

They are using the MMM (forward-seure sheme which is a tree-like digital signature scheme. The key evolution signatures are only valid for some time period, which is currently set to 120 in fnf.

The time period is based on slots, and it set to 3600.

It starts /w 0 (I think) and increasing by one in each time/kesperiod i.e. in every 3600 slots.
But, some log message would be nice to check.

The verifying key is constant during this 120 time period, but the singing key is evolving incrementally in each (n-th) time/kes period from the initial KES signing key, what is loaded together /w the operational certificate that contains the start kesPeriodwhen the node is started.

So, how it works, when we create an ops cert that contains the KES verifying key and we specify the kesPeriod (example 67 from echo $(( 241587 / 3600 )), it just simply means, that the operational certificate is validonly from this kes period till the 67+120-1 the 186.

And if we restart the node any time between those kesPeriods (67..186) that’s won’t be an issue as we have the original KES signing key and the ops certificate that contains the kesPeriod the 67 and we can evolve the new n-th signing key from the original and the difference of the current kesPeriod and the start one the 67. E.g. n = 157 - 67 = 90th

4 Likes

Also, it does not matter if I create a new KES keypair based on the cold key and the 172th kesPeriod, I can restart (or submit the opcert and KES signing key online through API) the node any time after that 172nd and the previous’ last 186th, the node happily will singing blocks /w the new KES keys and the old will be revoked as the cold counter is bigger in the new cert than in the old one.

However you can trick the system using same counter, but I am not sure how it would behave, but if it’s recognised by the protocol as a protocol violation then that pool should be punished very hard.

4 Likes

We should establish best practices for cold key generation, handling and storage.

Made a little script for the auto kes caculation and opcert. It is working together with the KES generator script…

#!/bin/bash

#load variables from common.sh
#       socket          Path to the node.socket (also exports socket to CARDANO_NODE_SOCKET_PATH)
#       genesisfile     Path to the genesis.json
#       magicparam      TestnetMagic paramter
source "$(dirname "$0")"/00_common.sh

if [[ ! $1 == "" ]]; then addrName=$1; else echo "ERROR - Usage: $0 <name>"; exit 2; fi

#grab the next issue number from the counter file
nextKESnumber=$(cat ${addrName}.node.counter | awk 'match($0,/Next certificate issue number: [0-9]+/) {print substr($0, RSTART+31,RLENGTH-31)}')
nextKESnumber=$(printf "%03d" ${nextKESnumber})  #to get a nice 3 digit output

#grab the latest generated KES number
latestKESnumber=$(cat ${addrName}.kes.counter)

if [[ ! "${nextKESnumber}" == "${latestKESnumber}" ]]; then echo -e "\e[0mERROR - Please generate new KES Keys first ...\e[0m"; exit 2; fi

echo
echo -e "\e[0mIssue a new Node operational certificate using KES-vKey \e[32m${addrName}.kes-${latestKESnumber}.vkey\e[0m and Cold-sKey \e[32m${addrName}.node.skey\e[0m:"
echo

#calculating current KES period
startTimeGenesis=$(cat ${genesisfile} | jq -r .startTime)
startTimeSec=$(date --date=${startTimeGenesis} +%s)	#in seconds (UTC)
currentTimeSec=$(date -u +%s)				#in seconds (UTC)
slotsPerKESPeriod=$(cat ${genesisfile} | jq -r .slotsPerKESPeriod)
slotLength=$(cat ${genesisfile} | jq -r .slotLength)
currentKESperiod=$(( (${currentTimeSec}-${startTimeSec}) / (${slotsPerKESPeriod}*${slotLength}) ))  #returns a integer number, we like that

#Calculating Expire KES Period and Date/Time
maxKESEvolutions=$(cat ${genesisfile} | jq -r .maxKESEvolutions)
expiresKESperiod=$(( ${currentKESperiod} + ${maxKESEvolutions} ))
expireTimeSec=$(( ${startTimeSec} + ( ${slotLength} * ${expiresKESperiod} * ${slotsPerKESPeriod} ) ))
expireDate=$(date --date=@${expireTimeSec})

echo -e "Latest KES-Keys and OpCert with KES-Counter#: ${latestKESnumber}\nValid starting with KES-Period: ${currentKESperiod}\nExpire after KES-Period: ${expiresKESperiod}\nExpire after Date: ${expireDate}\n" > ${addrName}.kes.expire

echo -e "\e[0mCurrent KES period:\e[32m ${currentKESperiod}\e[90m"
echo

${cardanocli} shelley node issue-op-cert --hot-kes-verification-key-file ${addrName}.kes-${latestKESnumber}.vkey --cold-signing-key-file ${addrName}.node.skey --operational-certificate-issue-counter ${addrName}.node.counter --kes-period ${currentKESperiod} --out-file ${addrName}.node-${latestKESnumber}.opcert

echo
echo -e "\e[0mNode operational certificate:\e[32m ${addrName}.node-${latestKESnumber}.opcert \e[90m"
cat ${addrName}.node-${latestKESnumber}.opcert
echo

echo
echo -e "\e[0mUpdated Operational Certificate Issue Counter:\e[32m ${addrName}.node.counter \e[90m"
cat ${addrName}.node.counter
echo

echo
echo -e "\e[0mUpdated Expire date file:\e[32m ${addrName}.kes.expire \e[90m"
cat ${addrName}.kes.expire
echo


echo -e "\e[0mNew \e[32m${addrName}.kes-${latestKESnumber}.skey\e[0m and \e[32m${addrName}.node-${latestKESnumber}.opcert\e[0m files ready for upload to the server."
echo
3 Likes

Looking at the chain density, it looks like a lot of pools did not renew their KES keys.

Everyone, make sure your KES operation certificate is renewed, loaded and that you node is minting blocks, please!

1 Like

Remark/question

What was unclear to me : Do we have to generate a new KES key pair when we are about to rotate the validity period of the previous KES ?

As it turned out, you can keep the same KES, provided you change the starting period time. It works, and after some thoughts, it makes sense. What matters is the new --kes-period and that the node.skey signs it.

But security wise, is there a valid reason not to do it like that? We discussed it a bit with Marek and I am not sure how to proceed. If KES.skey is compromised, it anyway means your server has been breached somehow. So, the incident might occur again and again …

Any blind spot in my reasoning ?
Cheers

It depend on your use case. I can see an enterprise grade stake pool that is just a preformated black box, with keys on it, no physical access and the only thing that is not on the read-only file system would be the opcert on an usb key that is picked up on boot. Just an example.

Because if you think about it, with using the same channels to transfer your KES keys as well as your opcert is just another angle of attack.

But in most cases, I guess it is safe to generate new KES keys too. Again, depends on individual use case.

You do not have to.

But you can fire up a new standby node in seconds/hours /w a new node cold counter, and the other compromised pool is demolished as soon as your new starts. Also, you would just loose some blocks therefore some rewards.

Also, opscert signing needs cold.skey, therfore the cert must be transferred, if that’s the case anyway I would be much more happy with new KES, then reuse the old ones.

Cold key is all about to prevent pools’ identity theft.

1 Like