Relays out of sync: HeaderProtocolError > ChainTransitionError > VRFLeaderValueTooBig

Our relays lost sync after the epoch boundary & are currently flooded with these error messages, which as far as I can tell are for every node in our topology, at 10 second intervals:

[relay-ny:cardano.node.DnsSubscription:Error:53946]
[2020-08-19 08:15:50.04 UTC]
[String "Application Exception: 54.201.36.182:6000 HeaderError
  (At (Block {blockPointSlot = SlotNo {unSlotNo = 6221024}, blockPointHash = ...}))
  (HeaderProtocolError (HardForkValidationErrFromEra S (Z (WrapValidationErr
    {unwrapValidationErr = ChainTransitionError [OverlayFailure (VRFLeaderValueTooBig
      (OutputVRF {getOutputVRFBytes = \"..."}) (3334337931 % 4383606254572)
      (ActiveSlotCoeff {unActiveSlotVal = UnsafeUnitInterval (1 % 20), unActiveSlotLog = ...})
    )]}))))
  (Tip (SlotNo {unSlotNo = 6230299}) ... (BlockNo {unBlockNo = 4576507}))
  (Tip (SlotNo {unSlotNo = 6258650}) ... (BlockNo {unBlockNo = 4578327}))",
 String "SubscriptionTrace",
 String "\"node1.stakelove.com\""]

I can’t debug anything like this so I hope one of the other SPOs can tie this to an observed solution, or a developer can make a recommendation about the software.

The relays have only a few “new tip” messages since the boundary, with the last one coming about 8 hours ago. We’re restarted both relays including a reboot of the second one in case it was a network layer problem.

We did upgrade briefly to 1.18.1 but rolled back a couple days before the boundary. From what the developers were saying, I thought we were advised to stay back on 1.18.0 because of potential issues like this. If our relays are not syncing then we can’t wait for the newer new version announced for next week… are there any recommendations?

Just popped the question into Telegram & no response there after 20 minutes: https://t.me/CardanoStakePoolWorkgroup/332907

Hi Robert,

It looks like the chain is corrupted from the edge case associated with the use of 1.18.1. Please reference your error vs the original github issue here:

I suggest stopping your relays and wiping the database:

cd
cd cardano-node
rm -R db

restart the relay and bring it back up to tip

FROG

1 Like

thanks @ADAfrog, I thought it would come to that since we did accumulate about 24 hours worth of DB data under 1.18.1. I’ll reference to Github as soon as we’re rebuilding & wil post any other observations here :sunglasses:

1 Like

p.s. based on the comments in the issue above (now closed) and the related issue I’m also rebuilding DB on our core node, since I cannot be sure that the ledger corruption in the relays hasn’t been passed on to our core (it was running 1.18.1 during the same period our relays were).

I can see @ADAfrog why you don’t want to be the first one to upgrade to new node releases. I can rebuild our core now with impunity since we’re not eligible for block election till next epoch, but otherwise this would have been a real dilemma. :fearful:

1 Like

For anyone that finds themselves in the same situation as we did, the word on Github is that deleting the (corrupted) ledger folder is enough, which only has to rebuild the ledger back from genesis, without re-syncing the entire chain (which is what is taking forever for us today).

2 Likes

IMPORTANT UPDATE -

It looks like using 1.18.1 in the epoch with d=1 will not cause future problems - but if you have ever used a node on 1.18.1 with d<1, you should absolutely rebuild your database from scratch.

1 Like

Hey ADAfrog! Can you please explain what d=1 means?

Hi Daniel,

“d” is the decentralization parameter, and will be decreased until community stake pools are minting 100% of blocks.

d = 1 implies 100% of all blocks are minted by federated nodes, and 0% minted by community stake pool operators.

d is currently set at 0.74 I believe, meaning 26% of the blocks are minted by community stake pool operators.

I hope this helps.

Your friend, FROG