Hi folks, I could really use some help with my preview stakepool that suddenly started failing.
I’ve been running a stakepool for a little over 2 years with over 40 mainnet blocks. My setup has a duplicate pool running in both preprod and preview networks for testing/stability.
I host from home on a single IP address. All traffic is routed through Traefik where it gets load balanced between 2 internal cardano relays. These are then connected to a single block producer node.
Recently I noticed that my preview nodes have gone a bit crazy and stopped producing blocks. It was producing blocks daily in preview and now it hasn’t produced any for over 10 days. And when I say “crazy” I mean the number of connected peers appears to be rapidly fluctuating every couple of minutes and the percentage of late blocks has been rising steadily. When I look in grafana all my charts are completely smooth and normal until November 20 when my peers start fluctuating a lot and the amount of late blocks received starts consistently rising (and is still rising). Also my block producer usually shows a very steady flat line of 2 duplex connections (to my 2 internal relays) but when this issue started it now regularly drops to 1 or 0 duplex connections for a second every 3 minutes or so. Also when I look at the block/slot number over time, on my preview node the graph looks like a series of “steps” with vertical and horizontal lines showing the block/slot suddenly jumping then freezing for a long time. When I plot the same metric in my mainnet or preprod nodes then the graph shows a very smooth rising curve. It seems like my preview node is just getting brief intermittent connectivity.
I can’t see any obvious error in the logs at all from either the relays or the block producer. Just tons and tons of logspam showing peer connection counts, but no error. Just an occasional socket error from a bad peer but nothing unexpected or very frequent.
The most recent sequence of events on the node is something like this:
- November 16 - rotated KES for stakepool. It was running on version 10.5.1
- The pool seemed to run smoothly and continue producing blocks until shortly after midnight on November 20
- November 20 - stop producing blocks, percentage of late blocks starts gradually climbing, peer connection numbers fluctuate wildly
- November 25 - bump to version 10.5.3
- November 26 - bump to version 10.6.0
- Dec 2 - bump to version 10.6.1 (heads up whoever is tagging pre-releases messed up tagging this version which is apparently the recommended version for preview despite not being tagged as prerelease?)
With the version bumps I also had to update my config to include the peer-snapshot.json to eliminate some warnings in the logs.
Despite the version upgrades and no errors in the logs there still seems to be something wildly wrong with my stakepool.
Thankfully the same issue has not surfaced in preprod or mainnet, it is only my preview stakepool having this issue. But I would like to fix if before the next major release.
Any thoughts on what might be causing the connection between block producer and relay to flutter every ~3 minutes? Any thoughts on what might be causing my connection to public peers to be fluctuating a lot? Is there any tracing I can turn on to get better log messages?
This issue really has me stuck.