CIP-135 Disaster Recovery Plan

We have proposed an updated disaster recovery plan as CIP-135. The link to the pull request can
be found here. Comments are welcomed, of course.

https://github.com/cardano-foundation/CIPs/pull/893

2 Likes

Hello @Kevin_Hammond

ā€œIdentify a small group of block producing nodes that will be used to recover the chain.ā€
Any chance this can be automated or randomly picked by the protocol.
Maybe use of speed of propagation + stake as a qualifier?

Good question. The group will need to have sufficient block production assigned over a certain period of time, and operators will need to cooperate to deal with the recovery at short notice (perhaps over non-standard network connections), so I canā€™t see that this can be automated, unfortunately.

What Iā€™d anticipate would be a pool of SPOs who would answer an emergency callout, and who would then work together to recover the chain. One criterion might be geographical proximity/connectivity.

1 Like

What about continuous ā€œon-callā€ list.

Block producing SPOs that are willing to share contact info and have technical skills can be placed in a group out of which % is selected by protocol as on-call disaster response each epoch.

This way it is never the same set of SPOs and selection can be partially randomized. Also, makes for faster recovery since first step is already taken and if someone in on-call team is unreachable there is a list of available SPOs that can step in.

Lets say out of 1000 block producing SPOs 100 sign up for on-call. Protocol assigns each a number 1 to 100 each epoch. On-call team for each epoch is considered to be in 1 to 20 group. If some SPOs donā€™t respond in n-time, then next on the list (number 21) takes their spot.

At least we automate and add transparency this way instead of hoping SPOs will just ā€œfindā€ each other when it happens. Not fully automated, however I feel we should try to automate as much as we can.

Yes, an on-call list would be a sensible way to proceed. The comms channels really need to be set up in advance (eg via a Discord channel) - if there is an issue that triggers Disaster Recovery, everyone will need to act quickly and in a coordinated way.

You donā€™t really need the Cardano protocol to select responders, I expect (and that would also need some development) - if you wanted randomisation of call-outs, there are alerting systems that might be able to do that (e.g. PagerDuty).

In practice, Disaster Recovery should be an extremely rare event, of course

1 Like

Definitely not a ā€œneedā€. It may actually be way more efficient to have something like this 100% off-chain.

However, having it on-chain (even partially) will be way more transparent and inclusive.
If just ā€˜some groupā€™ is selected people will feel itā€™s done and will not check, participate or disseminate knowledge, as weā€™ve seen with so many off-chain projects.

If itā€™s on-chain with clear standards we can have many SPOs aim to be part of it. We could run testnet scenarios which would be mandatory for SPOs to qualify, thus keeping many engaged and continuously learning.

We can have CF delegate some ā‚³ to all listed pools and to stay on the list you have to successfully pass testnet recovery scenario at least once every 36 epochs (as an example).
Every epoch where protocol selects SPO to be on-call CF can delegate double the ā‚³ amount, as an incentive to keep up to date with new scenarios.

This would create an open and competitive atmosphere and allow for anyone to learn how disaster recovery process works. Maybe even add to it or find bugs.

Keeping it all off-chain shrinks the pool of involved people and raises the questions about centralization and favoritism. Also, we donā€™t want be another ā€œe-mail joe atā€¦ in a case of crashā€ chain.

Disaster should be rare. Knowledge base access, inclusivity, transparency and fair access should be a constant.

(The following has nothing to do with you, just a bit of a rant :slightly_smiling_face:)
I think that we have entered the stage where people think that ā€œminimum viableā€ is acceptable as a temporary solution or better-then-nothing solution.
This is very dangerous thinking. For a system that is suppose to run uninterrupted and forever ā€œminimumsā€ can change very quickly. Terra-Luna was minimum viable, until it wasnā€™t :man_shrugging:

Since we are in a governance phase we know that changes wont be able to be implemented as quickly as before. So minimum can never be viable.

What we build and add to the protocol must be above minimum when it comes to decentralization, inclusivity and transparency. I know that this may go against efficiency in development and implementation in some cases. However, if we just start expending Cardano into an off-chain distributed system instead of decentralized system this can cause irreversible centralization, which is a disaster we can not recovery from.

3 Likes

Agreed. One reason for updating the plan and publishing it as a CIP was to kick off the discussion about the right support processes, and how to ensure community involvement. From responses to other incidents, weā€™ve seen that the Cardano community is highly self-organising and responsive, which is great! This just needs to be capitalised on

2 Likes