We have proposed an updated disaster recovery plan as CIP-135. The link to the pull request can
be found here. Comments are welcomed, of course.
Hello @Kevin_Hammond
āIdentify a small group of block producing nodes that will be used to recover the chain.ā
Any chance this can be automated or randomly picked by the protocol.
Maybe use of speed of propagation + stake as a qualifier?
Good question. The group will need to have sufficient block production assigned over a certain period of time, and operators will need to cooperate to deal with the recovery at short notice (perhaps over non-standard network connections), so I canāt see that this can be automated, unfortunately.
What Iād anticipate would be a pool of SPOs who would answer an emergency callout, and who would then work together to recover the chain. One criterion might be geographical proximity/connectivity.
What about continuous āon-callā list.
Block producing SPOs that are willing to share contact info and have technical skills can be placed in a group out of which % is selected by protocol as on-call disaster response each epoch.
This way it is never the same set of SPOs and selection can be partially randomized. Also, makes for faster recovery since first step is already taken and if someone in on-call team is unreachable there is a list of available SPOs that can step in.
Lets say out of 1000 block producing SPOs 100 sign up for on-call. Protocol assigns each a number 1 to 100 each epoch. On-call team for each epoch is considered to be in 1 to 20 group. If some SPOs donāt respond in n-time, then next on the list (number 21) takes their spot.
At least we automate and add transparency this way instead of hoping SPOs will just āfindā each other when it happens. Not fully automated, however I feel we should try to automate as much as we can.
Yes, an on-call list would be a sensible way to proceed. The comms channels really need to be set up in advance (eg via a Discord channel) - if there is an issue that triggers Disaster Recovery, everyone will need to act quickly and in a coordinated way.
You donāt really need the Cardano protocol to select responders, I expect (and that would also need some development) - if you wanted randomisation of call-outs, there are alerting systems that might be able to do that (e.g. PagerDuty).
In practice, Disaster Recovery should be an extremely rare event, of course
Definitely not a āneedā. It may actually be way more efficient to have something like this 100% off-chain.
However, having it on-chain (even partially) will be way more transparent and inclusive.
If just āsome groupā is selected people will feel itās done and will not check, participate or disseminate knowledge, as weāve seen with so many off-chain projects.
If itās on-chain with clear standards we can have many SPOs aim to be part of it. We could run testnet scenarios which would be mandatory for SPOs to qualify, thus keeping many engaged and continuously learning.
We can have CF delegate some ā³ to all listed pools and to stay on the list you have to successfully pass testnet recovery scenario at least once every 36 epochs (as an example).
Every epoch where protocol selects SPO to be on-call CF can delegate double the ā³ amount, as an incentive to keep up to date with new scenarios.
This would create an open and competitive atmosphere and allow for anyone to learn how disaster recovery process works. Maybe even add to it or find bugs.
Keeping it all off-chain shrinks the pool of involved people and raises the questions about centralization and favoritism. Also, we donāt want be another āe-mail joe atā¦ in a case of crashā chain.
Disaster should be rare. Knowledge base access, inclusivity, transparency and fair access should be a constant.
(The following has nothing to do with you, just a bit of a rant )
I think that we have entered the stage where people think that āminimum viableā is acceptable as a temporary solution or better-then-nothing solution.
This is very dangerous thinking. For a system that is suppose to run uninterrupted and forever āminimumsā can change very quickly. Terra-Luna was minimum viable, until it wasnāt
Since we are in a governance phase we know that changes wont be able to be implemented as quickly as before. So minimum can never be viable.
What we build and add to the protocol must be above minimum when it comes to decentralization, inclusivity and transparency. I know that this may go against efficiency in development and implementation in some cases. However, if we just start expending Cardano into an off-chain distributed system instead of decentralized system this can cause irreversible centralization, which is a disaster we can not recovery from.
Agreed. One reason for updating the plan and publishing it as a CIP was to kick off the discussion about the right support processes, and how to ensure community involvement. From responses to other incidents, weāve seen that the Cardano community is highly self-organising and responsive, which is great! This just needs to be capitalised on