Preview Stakepool stopped producing blocks on November 20

Hi folks, I could really use some help with my preview stakepool that suddenly started failing.

I’ve been running a stakepool for a little over 2 years with over 40 mainnet blocks. My setup has a duplicate pool running in both preprod and preview networks for testing/stability.

I host from home on a single IP address. All traffic is routed through Traefik where it gets load balanced between 2 internal cardano relays. These are then connected to a single block producer node.

Recently I noticed that my preview nodes have gone a bit crazy and stopped producing blocks. It was producing blocks daily in preview and now it hasn’t produced any for over 10 days. And when I say “crazy” I mean the number of connected peers appears to be rapidly fluctuating every couple of minutes and the percentage of late blocks has been rising steadily. When I look in grafana all my charts are completely smooth and normal until November 20 when my peers start fluctuating a lot and the amount of late blocks received starts consistently rising (and is still rising). Also my block producer usually shows a very steady flat line of 2 duplex connections (to my 2 internal relays) but when this issue started it now regularly drops to 1 or 0 duplex connections for a second every 3 minutes or so. Also when I look at the block/slot number over time, on my preview node the graph looks like a series of “steps” with vertical and horizontal lines showing the block/slot suddenly jumping then freezing for a long time. When I plot the same metric in my mainnet or preprod nodes then the graph shows a very smooth rising curve. It seems like my preview node is just getting brief intermittent connectivity.

I can’t see any obvious error in the logs at all from either the relays or the block producer. Just tons and tons of logspam showing peer connection counts, but no error. Just an occasional socket error from a bad peer but nothing unexpected or very frequent.

The most recent sequence of events on the node is something like this:

  1. November 16 - rotated KES for stakepool. It was running on version 10.5.1
  2. The pool seemed to run smoothly and continue producing blocks until shortly after midnight on November 20
  3. November 20 - stop producing blocks, percentage of late blocks starts gradually climbing, peer connection numbers fluctuate wildly
  4. November 25 - bump to version 10.5.3
  5. November 26 - bump to version 10.6.0
  6. Dec 2 - bump to version 10.6.1 (heads up whoever is tagging pre-releases messed up tagging this version which is apparently the recommended version for preview despite not being tagged as prerelease?)

With the version bumps I also had to update my config to include the peer-snapshot.json to eliminate some warnings in the logs.

Despite the version upgrades and no errors in the logs there still seems to be something wildly wrong with my stakepool.

Thankfully the same issue has not surfaced in preprod or mainnet, it is only my preview stakepool having this issue. But I would like to fix if before the next major release.

Any thoughts on what might be causing the connection between block producer and relay to flutter every ~3 minutes? Any thoughts on what might be causing my connection to public peers to be fluctuating a lot? Is there any tracing I can turn on to get better log messages?

This issue really has me stuck.

You have missed the fork incident that happened on 20th November on Preview and on 21st November on Mainnet: https://intersectmbo.org/news/incident-report-network-partition-analysis-and-resolution-strategy

Your Mainnet nodes probably have gone to the correct fork when they managed to overtake the bad fork and all nodes rolled back to the good one.

But on Preview, it was already beyond the point where rollbacks are allowed and your nodes stayed on the bad fork.

You would have to truncate the database to before the fork. Command that was shared in the IOG Discord for that was:

db-truncater --db /path/to/cardano_db --truncate-after-slot 96900000 cardano --config /path/to/config.json
rm -rf /path/to/cardano_db/{ledger,volatile,clean}

Also, a lot of SPOs seem to have needed to set ConsensusMode in their config.json back from GenesisMode to PraosMode.
I don’t know if 10.6.1 is a fixed version (and if a non-fixed one is still at risk of getting on the wrong fork). Official recommendation still seems to be to use 10.5.3.
At this point, it’s maybe easier to set the nodes up again from scratch.

Also, you should probably join some SPO communication channels to get news like this more timely.

Thanks @HeptaSean I suspected it was related to that incident but missed the finer details around how to recover on the preview. I did spend some time on google but all the results just talked about the mainnet (or tons of SEO spam for random unrelated crypto).

Can you recommend some good communication channels for these types of udpates? I’ve honestly struggled to find any way of staying updated. I don’t really have time to be checking in daily on forums so I could use something like a summary newsletter or blog, but the intersect newsletter leaves much to be desired and the blog posts are far too generic and not specific enough to running a stakepool.

@HeptaSean I also just noticed I don’t seem to be in the “IOG Discord” where can I find the invite/details for that channel? I am in the intersect MBO discord but I just checked and I’m surprised to see there doesn’t seem to be an SPO channel there :confused:

Hope, this works: https://discord.gg/JaMSRpXXYE

You’d have to choose that you are an SPO in the #start-here channel. Then, a whole category of SPO-related channels opens. One of them is also #spo-announcements which should have the most important stuff covered.

1 Like

Thank you, I don’t want to get my hopes too high but this does seem like the exact channel I need and have been looking for. Thank you again @HeptaSean

Confirming charts look much better now and I am producing blocks again in preview. Praos brought the connections back to being much more stable, looking forward to genesis though.

I attempted a clean refresh (wipe out the data directory) of my nodes when I was troubleshooting with 10.6.0 but that didn’t fix the issue at the time.

I have now rolled back to 10.5.3, set consensus mode to Praos, and then reset my nodes again.

Charts look healthy and match preprod/mainnet now.