I’ve had my eye on this for several weeks with no further ideas, so hoping some analytically oriented SPOs and maybe a node developer can provide some insight & more things to check…
Our pool has two identically configured relays running on identical cloud instances, but one of them is running well within its specifications and the other one is exceeding them. The normal relay is on the same data centre subnet as our block producer, in North America, and the stressed relay is in a similar data centre in Asia where network connectivity appears 3-4x better than USA.
We’re on node version
1.24.2 with the default
config.json, nothing extra besides the node, producing all scheduled blocks. The normal relay gradually allocates memory until physical memory is nearly used up (I guess as an optimisation), but the stressed node keeps grabbing more memory beyond that until it cuts substantially into the swap space.
Once the disk I/O in our performance graphs on the stressed node suggests paging activity, we can see our load average often stays above
1 while the “normal” node stays in the
0.5 range. At this point I generally feel like restarting the stressed relay, after which it takes about 2 or 3 days for the problem to build up again.
As far as I can tell our nodes are identical except for the asymmetry in our local topology. Both relays have 10 relays other than our own in
netstat -46 shows roughly the same number of network peers as well as the same balance between incoming & outgoing connections.
Our transaction processing is robust, with the Forge data for both relays showing one successful transaction added every 6 or 7 seconds. The only difference (EDIT: that I could see at the time) is the number of bad transactions so that’s been my hypothesis about what’s been dragging down the stressed relay: the extra CPU load of evaluating the invalid transactions.
Since I haven’t seen this data available in Forge we gathered it from our logfiles (we use
... | nice grep 'BadInputsUTxO"),("error",String "The transaction contains inputs that do not exist in the UTxO set' | wc -l
… for periods since our
1.24.2 upgrade on 09 December. In all that time we had 1.2 million bad Tx on our normal relay but 2.0 million on our stressed relay (65% higher). Over the last month the disparity between the bad Tx frequency has gradually been getting less: in all of yesterday (Jan 09) it was only 31% higher: not as dramatic, but still noticeable. Through all these periods the frequency of accepted Tx is the same.
So my questions are:
Is this difference in bad Tx between the two relays enough to create an observable difference in their CPU & memory usage? If not, what else might be causing it & what specific tests might we use?
Whether or not topology is handled by the forthcoming “peer-to-peer governor” in the near future, does or will anyone have any level of control over which hosts can submit bad transactions?
Does anything about our topology or geographical distribution make the remote node more of a target for bad transactions?
Could the toplogy & location alone account for our uneven load distribution, without any consideration of the bad Tx?
Both our “normal” and our stressed relay are 2-vCPU Ubuntu instances with 4Gig RAM. Any suggestions that we simply throw more memory and CPU at the stressed relay, or upgrade them both, may be justifiable but off topic in this case, since we need to understand how to analyse the disparity in our current setup.