CIP-135 Disaster Recovery Plan

Kevin_Hammond · 18 September 2024 16:33

We have proposed an updated disaster recovery plan as CIP-135. The link to the pull request can
be found here. Comments are welcomed, of course.

https://github.com/cardano-foundation/CIPs/pull/893

Neo_Spank · 19 September 2024 12:34

Hello @Kevin_Hammond

“Identify a small group of block producing nodes that will be used to recover the chain.”
Any chance this can be automated or randomly picked by the protocol.
Maybe use of speed of propagation + stake as a qualifier?

Kevin_Hammond · 20 September 2024 20:02

Good question. The group will need to have sufficient block production assigned over a certain period of time, and operators will need to cooperate to deal with the recovery at short notice (perhaps over non-standard network connections), so I can’t see that this can be automated, unfortunately.

What I’d anticipate would be a pool of SPOs who would answer an emergency callout, and who would then work together to recover the chain. One criterion might be geographical proximity/connectivity.

Neo_Spank · 20 September 2024 21:44

What about continuous “on-call” list.

Block producing SPOs that are willing to share contact info and have technical skills can be placed in a group out of which % is selected by protocol as on-call disaster response each epoch.

This way it is never the same set of SPOs and selection can be partially randomized. Also, makes for faster recovery since first step is already taken and if someone in on-call team is unreachable there is a list of available SPOs that can step in.

Lets say out of 1000 block producing SPOs 100 sign up for on-call. Protocol assigns each a number 1 to 100 each epoch. On-call team for each epoch is considered to be in 1 to 20 group. If some SPOs don’t respond in n-time, then next on the list (number 21) takes their spot.

At least we automate and add transparency this way instead of hoping SPOs will just “find” each other when it happens. Not fully automated, however I feel we should try to automate as much as we can.

Kevin_Hammond · 22 September 2024 19:58

Yes, an on-call list would be a sensible way to proceed. The comms channels really need to be set up in advance (eg via a Discord channel) - if there is an issue that triggers Disaster Recovery, everyone will need to act quickly and in a coordinated way.

You don’t really need the Cardano protocol to select responders, I expect (and that would also need some development) - if you wanted randomisation of call-outs, there are alerting systems that might be able to do that (e.g. PagerDuty).

In practice, Disaster Recovery should be an extremely rare event, of course

Neo_Spank · 22 September 2024 21:48

Definitely not a “need”. It may actually be way more efficient to have something like this 100% off-chain.

However, having it on-chain (even partially) will be way more transparent and inclusive.
If just ‘some group’ is selected people will feel it’s done and will not check, participate or disseminate knowledge, as we’ve seen with so many off-chain projects.

If it’s on-chain with clear standards we can have many SPOs aim to be part of it. We could run testnet scenarios which would be mandatory for SPOs to qualify, thus keeping many engaged and continuously learning.

We can have CF delegate some ₳ to all listed pools and to stay on the list you have to successfully pass testnet recovery scenario at least once every 36 epochs (as an example).
Every epoch where protocol selects SPO to be on-call CF can delegate double the ₳ amount, as an incentive to keep up to date with new scenarios.

This would create an open and competitive atmosphere and allow for anyone to learn how disaster recovery process works. Maybe even add to it or find bugs.

Keeping it all off-chain shrinks the pool of involved people and raises the questions about centralization and favoritism. Also, we don’t want be another “e-mail joe at… in a case of crash” chain.

Disaster should be rare. Knowledge base access, inclusivity, transparency and fair access should be a constant.

(The following has nothing to do with you, just a bit of a rant )
I think that we have entered the stage where people think that “minimum viable” is acceptable as a temporary solution or better-then-nothing solution.
This is very dangerous thinking. For a system that is suppose to run uninterrupted and forever “minimums” can change very quickly. Terra-Luna was minimum viable, until it wasn’t

Since we are in a governance phase we know that changes wont be able to be implemented as quickly as before. So minimum can never be viable.

What we build and add to the protocol must be above minimum when it comes to decentralization, inclusivity and transparency. I know that this may go against efficiency in development and implementation in some cases. However, if we just start expending Cardano into an off-chain distributed system instead of decentralized system this can cause irreversible centralization, which is a disaster we can not recovery from.

Kevin_Hammond · 23 September 2024 11:12

Agreed. One reason for updating the plan and publishing it as a CIP was to kick off the discussion about the right support processes, and how to ensure community involvement. From responses to other incidents, we’ve seen that the Cardano community is highly self-organising and responsive, which is great! This just needs to be capitalised on

Topic		Replies	Views
The Cardano Foundation's Response to the Parameter Committee Recommendation in PCP-001 News and Announcements spo-poll	34	4032	25 September 2023
HA Stake Pool Cluster Operate a Stake Pool	3	915	20 August 2020
Initial questions with respect to becoming a stake-pool operator Operate a Stake Pool	4	1045	22 January 2021
BP node block producing capability checkup command Operate a Stake Pool cardano-node	3	724	4 November 2020
Announcing the stake pools chosen for October 2023 News and Announcements delegation-strategy	15	1896	7 November 2023

CIP-135 Disaster Recovery Plan

Related topics