[CRAB] botched KES key rotation incident report

I’m sorry to report that there has been an incident with the CRAB Cardano stake pool: when rotating the KES key a few days ago, I did it incorrectly, meaning that 4 blocks from the current Epoch 221 were missed. Here follows an incident report, explaining the context, what happened, how it was discovered, the effects, and the remediations put in place to decrease the chance of this happening in the future. I’m publishing this information both in the interests of transparency to members of the CRAB stake pool, and to help other Cardano stake pool operators to avoid making similar mistakes.

5 Likes

Thanks @tiredpixel for your documentation of the experience & benevolently coming forward for the benefit of the community :heart_eyes:

3 Likes

Thank you for your kind words. :slight_smile:

1 Like

It’s great that we can come here to help one another as a community of SPOs (and aspiring SPOs) as well as help the cardano network. :slight_smile:

1 Like

Great post - very informative!
Hopefully software will improve with time, and this will not be as significant an issue in the future. I’m hoping those at IOG realize the value that SPOs contribute to the network and will make an effort on this front to improve the safety of stake pool operations. Then, of course, once voting comes “online,” maybe we can vote to put a lot of financial resources into making a safer environment for SPOs via software dev. Still don’t know exactly the mechanics of voting and how funds will be disbursed, but that one avenue SPO can likely pursue to improve conditions for themselves.

Yes, it’s very complicated and things break down easily. I spent like 1.5 days trying to ssh over wi-fi a home on two linux machines. The ssh daemons often crash, or become disabled…could be a safety feature built it…I don’t know…configuring ports on ssh_config, sshd_config was a pain, had to configure /etc/hosts as well, and set up both ssh clients and daemons on both to be safe…then you have to learn that CLI ssh commands take precedence over local use settings in ~/.ssh config file, which in turn take precedence over global settings in /etc/ssh. Then the issue of setting up ufw on both so as not to mess up comunication.
What’s odd is that my relay says sshd service doesn’t exist, when I do command "service ssh status | grep act, it says it doesn’t exist - it doesn’t even say it’s inactive, yet there it is the service listenting on the correct port, allowing me to still make connections to it. So complicated…but I did have success finally today…got my ticker up and running on testnet…It can be very frustrating to run a stakepool, but as the cliche goes, if you have to go on a thousand-mile journey, don’t get distracted by counting the number of steps taken…just go with the flow…it’s a new learning experience, and if we come out of this OK, and cardano succeeds, we will have already gained so much experience, we’ll be at a huge advantage over those that want to join much later.

1 Like

Thanks! :slight_smile: Although I’m a little confused—what was the relevance of voting? Or do you mean that we’ll be able to vote on things like leadership schedules—or failing that, at least errors-on-boot if keys and certificates don’t match, etc., or repeated errors if a major fault in the block-producing node is discovered (owing to my mistake with the KES key rotation, for example)? :slight_smile:

I haven’t look at what areas voting covers, but I’m hoping through it we can contribute funds to have more dev support for creating a safer environment for SPOs and easier interface (make the software more user-friendly) for SPOs interacting with the Cardano blockchain and leger.

Yes, so even the smallest errors throw up red flags for SPOs. Maybe there is a way to run a stake pool so that only errors are printed, and you get messaged about them immediately…I don’t think it would be that difficult to add that kind of support.
Have a complete list of errors and a corresponding list of possible causes for each error from most likely to least, so you minimize the amount of time hunting down the error.

1 Like

I see—thanks for the clarification. :slight_smile: I think it’s actually already possible to do what you’re suggesting with regards to errors, by changing the log level and deviating from the default config. I admit I haven’t done so so far, though (although I was definitely thinking about it when conducting the post-mortem on this incident…). But definitely agree that it would be better if even the smallest errors threw up red flags for SPOs; my expectation from most other software would be to raise such a red flag immediately on boot if such a mismatch were detected (I think it could be in this case? but I haven’t dug into it deeply)—or at the very least to go into a crash-reboot loop if such a critical failure were detected (although that’s admittedly more Erlangish or even disposable-Docker-container in philosophy :slight_smile: ).

1 Like

This is such an important issue not only to SPOs, but to delegators and well, and ultimately for the success of the Cardano blockchain. Delegators and SPOs are not going to stay with Cardnano long if producing blocks is such an arduous task. I remember mining cryptos…it was a much easier turn-key solution. A staking pool is a whole new level of complexity that I didn’t expect, but I guess that’s the trade-off for all the security guarantees and features Cardano is trying to implement. But then again, mining has had a significant head start over staking. Unfortunately, even the first staking coin, Peercoin, went downhill, I didn’t expect that…I thought it would succeed, not as much as Bitcoin, but at least succeed as the first of it’s kind. Given enough time as BTC has had, maybe SPO difficulties will become a thing of the past for Cardano.

1 Like

Yes. But of course, there are other staking coins, too, as you rightly observe. And even if all these (and so many other) issues are resolved after the same time as Bitcoin has had—where will those SPOs and similar who have engaged deeply with the cryptocurrency be? Will they still be donating time, money, and skills to such, and patiently waiting—and hoping—and gradually helping to bring it to fruition? Or will they consider things like their short- or even medium-term ROI (not just ROA stake, and not just money and hosting costs—but also the huge amount of time donated), and decide to call it a day? I was thinking about this issue a few days ago; it seems to be that at the current trajectory, many of the experienced SPOs may leave (or at least, those with the transferable skills learning on-the-‘job’ who have proven their knowledge in related technologies, just not yet with Cardano), leaving only the very lucky or extremely well-funded pools remaining—whilst there’s still plenty of excitement about Cardano in general to keep churning new stake pool joining without the battle-tested knowledge earned by those who have been operating the software and contributing long enough? :expressionless:

1 Like

Agreed, only the biggest might remain, and we’ll have an analagous situation to BTC mining pools, but anyway, I’m sticking with Cardano because it’s the most advance crypto foundationally, taking 5 years of scientific research to prove it’s feasibility on paper. I don’t think there will be another coin like it for a long time, at least in the US, because US legislators are thinking of introducing a new law allowing companies up to 3 years to produce a working version of a public blockchain. Cardano had 5 so it’s very unlikely we will see similar results from future blockchains that will give Cardano a competitive run for its money if the typical sufficiently uniformed and corrupt politicians get their way with further restrictions on new cyrptos. Not financial advice - just my opinion.

1 Like

Very interesting. I didn’t know that about the US legislation. Here in Europe, things are generally more balanced at present—although of course that could change very quickly, and many aspects of cryptocurrencies in general remain without specific regulation (even with regards to which accounting model to use when trading cryptocurrency assets).

1 Like

That reminds me of an article I read earlier today…I thought London was pro-crypto, being one of the main financial hubs of the world, but then this article came out today: https://www.reuters.com/article/britain-cryptoassets-ban/uk-watchdog-bans-bitcoin-based-products-for-retail-investors-idUSL8N2GX2SU

Last year, the US gov (Congress) introduced 32 crypto-related bills 78% of which focus on negative aspects of cryptos; i.e., bills which place further and further restrictions on cryptos.


Full article: https://www.forbes.com/sites/jasonbrett/2020/04/28/congress-has-introduced-32-crypto-and-blockchain-bills-for-consideration-in-2019-2020/#70c0fbcf1d61

NONE OF THESE BILLS FOCUS ON SUPPORTING PUBLIC CRYPTOS!
That basically tells you they only care about the US dollar global financial hegemony and maintaining it via blockchain in a centralized legacy fashion - they are going to fall flat on their faces, and the dollar will lose its standing if the US keeps putting up roadblocks to public cryptos that facilitate the transition of the legacy system to a modern, cryptographic one…if they don’t adopt this tech, some other major country will run away with this tech and use it the way it was intended to be used.

2 Likes

Thanks! Posting your mistake will help other SPO’s from making the same one.

Can you verify exactly what needs to be rotated before KES expires? I believe all that needs to be done is generate a new KES key pair (kes.vkey & kes.skey), then use those to make a new node.cert. The vrf.skey can remain untouched on the node, only replacing the key.skey and node.cert.

2 Likes

You are correct. After a new KES key pair is generated, and that in turn used to create a new node operational certificate, the only files requiring updating on the node itself are kes.skey and node.cert.

It indeed isn’t necessary to update vrf.skey or regenerate that (the regeneration for that is even more complex, from what I understand, and requiring notice periods of some epochs within the network; it can’t simply be replaced without losing blocks).

2 Likes

That’s power. :clap: