Relay node running good until stuck at one slot (invalid block)

Hi
I’m running a cardano relay node in P2P running a long time without problems. Recently upgraded to 10.1.4, continued to work well until a few days ago when it started to be stuck at a particular block, with many errors of the following form:

** 2025-03-08T20:37:16.357204+01:00 SERVER cardano-node[555]: #033[34m[SERVER:cardano.node.InboundGovernor:Info:183]#033[0m [2025-03-08 19:37:16.35 UTC] TrMuxErrored (ConnectionId {localAddress = MYIP:3001, remoteAddress = 34.92.222.93:1338}) (InvalidBlock (At (Block {blockPointSlot = SlotNo 149690156, blockPointHash = 0c28ad6a7b445475d8c4961b0bdd8e7af41b94d25ee467a08f1b0fd02a5c30e7})) 0bc7ce20b6b5fc34a51300eadad64bb050db73b712632c4711be5d1cc9f511a4 (ValidationError (ExtValidationErrorLedger (HardForkLedgerErrorFromEra S (S (S (S (S (S (Z (WrapLedgerErr {unwrapLedgerErr = BBodyError (BlockTransitionError (LedgersFailure (LedgerFailure (ConwayWdrlNotDelegatedToDRep (KeyHash {unKeyHash = “0521608ad5f86f6856beeafda4d77b2371c1521c117ae5955a79af27”} :expressionless: ))) :expressionless: ))})))))))))))

2025-03-08T20:37:16.357387+01:00 lstcr5 cardano-node[555]: #033[34m[SERVER:cardano.node.InboundGovernor:Info:183]#033[0m [2025-03-08 19:37:16.35 UTC] TrInboundGovernorCounters (InboundGovernorCounters {coldPeersRemote = 1, idlePeersRemote = 5, warmPeersRemote = 2, hotPeersRemote = 1})

2025-03-08T20:37:16.357481+01:00 SERVER cardano-node[555]: #033[34m[SERVER:cardano.node.PeerSelectionCounters:Info:187]#033[0m [2025-03-08 19:37:16.35 UTC] PeerSelectionView {viewRootPeers = 60, viewKnownPeers = 45, viewAvailableToConnectPeers = 13, viewColdPeersPromotions = 4, viewEstablishedPeers = 9, viewWarmPeersDemotions = 0, viewWarmPeersPromotions = 0, viewActivePeers = 2, viewActivePeersDemotions = 0, viewKnownBigLedgerPeers = 15, viewAvailableToConnectBigLedgerPeers = 4, viewColdBigLedgerPeersPromotions = 4, viewEstablishedBigLedgerPeers = 0, viewWarmBigLedgerPeersDemotions = 0, viewWarmBigLedgerPeersPromotions = 0, viewActiveBigLedgerPeers = 0, viewActiveBigLedgerPeersDemotions = 0, viewKnownLocalRootPeers = 1, viewAvailableToConnectLocalRootPeers = 1, viewColdLocalRootPeersPromotions = 0, viewEstablishedLocalRootPeers = 1, viewWarmLocalRootPeersPromotions = 0, viewActiveLocalRootPeers = 1, viewActiveLocalRootPeersDemotions = 0, viewKnownNonRootPeers = 0, viewColdNonRootPeersPromotions = 0, viewEstablishedNonRootPeers = 0, viewWarmNonRootPeersDemotions = 0, viewWarmNonRootPeersPromotions = 0, viewActiveNonRootPeers = 0, viewActiveNonRootPeersDemotions = 0, viewKnownBootstrapPeers = 0, viewColdBootstrapPeersPromotions = 0, viewEstablishedBootstrapPeers = 0, viewWarmBootstrapPeersDemotions = 0, viewWarmBootstrapPeersPromotions = 0, viewActiveBootstrapPeers = 0, viewActiveBootstrapPeersDemotions = 0}

2025-03-08T20:37:16.357595+01:00 SERVER cardano-node[555]: #033[34m[SERVER:cardano.node.PeerSelection:Info:187]#033[0m [2025-03-08 19:37:16.35 UTC] TracePromoteWarmDone 2 2 95.217.220.29:3001

2025-03-08T20:37:16.368972+01:00 SERVER cardano-node[555]: #033[34m[SERVER:cardano.node.ConnectionManager:Info:6604]#033[0m [2025-03-08 19:37:16.36 UTC] TrConnectionHandler (ConnectionId {localAddress = MYIP:3001, remoteAddress = 34.92.222.93:1338}) (TrConnectionHandlerError OutboundError (InvalidBlock (At (Block {blockPointSlot = SlotNo 149690156, blockPointHash = 0c28ad6a7b445475d8c4961b0bdd8e7af41b94d25ee467a08f1b0fd02a5c30e7})) 0bc7ce20b6b5fc34a51300eadad64bb050db73b712632c4711be5d1cc9f511a4 (ValidationError (ExtValidationErrorLedger (HardForkLedgerErrorFromEra S (S (S (S (S (S (Z (WrapLedgerErr {unwrapLedgerErr = BBodyError (BlockTransitionError (LedgersFailure (LedgerFailure (ConwayWdrlNotDelegatedToDRep (KeyHash {unKeyHash = “0521608ad5f86f6856beeafda4d77b2371c1521c117ae5955a79af27”} :expressionless: ))) :expressionless: ))}))))))))))) ShutdownPeer)**

All of these errors occur at slot no 149690156. The command cardano-cli query tip --mainnet returns
{
** “block”: 11567820,**
** “epoch”: 544,**
** “era”: “Conway”,**
** “hash”: “aa71bda8b09756ef1535696c7e6ffcdc864d511547566700f4445945219e9271”,**
** “slot”: 149690099,**
** “slotInEpoch”: 45299,**
** “slotsToEpochEnd”: 386701,**
** “syncProgress”: “99.91”**
}

and it stays stuck there.

There are also a lot of other log entries (vs when the node was running fine) but i think they are from the P2P-connected nodes that are serving the invalid block to my node (the one which my node thinks it is invalid) and it always sets these nodes to cold after the invalid block message. So in gliveview, there are always some cold, warm and hot nodes but the numbers are changing every few seconds. Just hot stays always on 2. There are always around 5-10 incoming and some 30-40 outgoing peers.

My Cardano DB partition has 23 GB left (its not too much but should be sufficient for now), I have a 32 GB RAM setup with much better CPU than the minimum requirement, made sure my config files are exactly identical to Mainnet - The Cardano Operations Book to rule out config errors.

What would be the next steps to resolve this problem? Is it possible to go back with the DB to before this slot/block and try if it goes well next time? Or can I only delete the whole DB and restart from scratch (or a snapshot) and not only delete a part of it?

Thanks for your help!

Hi @ogre5000 I seem to be having the same problem. Did you find a resolution?

@s2cicf I wasn’t able to resolve the real cause of this problem even after quite a bit of research on the internet, so here is what I did to resolve it with a workaround:

At the time of this happening, I had 2 relays (one online and one offline for maintenance) and 1 block producer, all running most recent mainnet releases 10.1.4 of cardano-node. So I fired up my offline relay again and it synced to 100% without this problem. I then took it offline again, copied the db folder to the relay with the problem, and TADA this relay worked again too. But this problem also got to my block producer, which was not able to connect with these working & synced relays anymore. Unfortunately I don’t remember the exact error messages. Probably these were the errors, along with the connection errors trying to establish hot connection with the relays and instantly going to cold again:

Mar 10 11:13:07 SERVER2 cardano-node[2098655]: #033[31m[lstcr4:cardano.node.Forge:Error:209]#033[0m [2025-03-10 10:13:07.05 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNoLedgerView"),("slot",Number 1.50035296e8)]))]
Mar 10 11:13:08 SERVER2 cardano-node[2098655]: #033[31m[lstcr4:cardano.node.Forge:Error:209]#033[0m [2025-03-10 10:13:08.06 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNoLedgerView"),("slot",Number 1.50035297e8)]))]

So I did the same on the block producer, took one relay and BP offline, copied the db folder to the block producer after deleting its old db folder and started relay and BP again. After this, all worked well again until now…

Just out of curiosity did you have this happening at the exact same slot?: InvalidBlock (At (Block {blockPointSlot = SlotNo 149690156