Bad transactions (or outdated nodes): effect on performance?

I’ve had my eye on this for several weeks with no further ideas, so hoping some analytically oriented SPOs and maybe a node developer can provide some insight & more things to check…

Our pool has two identically configured relays running on identical cloud instances, but one of them is running well within its specifications and the other one is exceeding them. The normal relay is on the same data centre subnet as our block producer, in North America, and the stressed relay is in a similar data centre in Asia where network connectivity appears 3-4x better than USA.

We’re on node version 1.24.2 with the default config.json, nothing extra besides the node, producing all scheduled blocks. The normal relay gradually allocates memory until physical memory is nearly used up (I guess as an optimisation), but the stressed node keeps grabbing more memory beyond that until it cuts substantially into the swap space.

Once the disk I/O in our performance graphs on the stressed node suggests paging activity, we can see our load average often stays above 1 while the “normal” node stays in the 0 to 0.5 range. At this point I generally feel like restarting the stressed relay, after which it takes about 2 or 3 days for the problem to build up again.

As far as I can tell our nodes are identical except for the asymmetry in our local topology. Both relays have 10 relays other than our own in topology.json and netstat -46 shows roughly the same number of network peers as well as the same balance between incoming & outgoing connections.

Our transaction processing is robust, with the Forge data for both relays showing one successful transaction added every 6 or 7 seconds. The only difference (EDIT: that I could see at the time) is the number of bad transactions so that’s been my hypothesis about what’s been dragging down the stressed relay: the extra CPU load of evaluating the invalid transactions.

Since I haven’t seen this data available in Forge we gathered it from our logfiles (we use SimpleView):

... | nice grep 'BadInputsUTxO"),("error",String "The transaction contains inputs that do not exist in the UTxO set' | wc -l

… for periods since our 1.24.2 upgrade on 09 December. In all that time we had 1.2 million bad Tx on our normal relay but 2.0 million on our stressed relay (65% higher). Over the last month the disparity between the bad Tx frequency has gradually been getting less: in all of yesterday (Jan 09) it was only 31% higher: not as dramatic, but still noticeable. Through all these periods the frequency of accepted Tx is the same.

So my questions are:

  1. Is this difference in bad Tx between the two relays enough to create an observable difference in their CPU & memory usage? If not, what else might be causing it & what specific tests might we use?

  2. Whether or not topology is handled by the forthcoming “peer-to-peer governor” in the near future, does or will anyone have any level of control over which hosts can submit bad transactions?

  3. Does anything about our topology or geographical distribution make the remote node more of a target for bad transactions?

  4. Could the toplogy & location alone account for our uneven load distribution, without any consideration of the bad Tx?

Both our “normal” and our stressed relay are 2-vCPU Ubuntu instances with 4Gig RAM. Any suggestions that we simply throw more memory and CPU at the stressed relay, or upgrade them both, may be justifiable but off topic in this case, since we need to understand how to analyse the disparity in our current setup. :sunglasses:

Maybe u hit this bug

Anyway, can u see what peoccess consuming your resources?

It looks like that could be it. I saw the HardForkEncoderDisabledEra messages early on and decided simply to remove them from our view of the logfiles, mostly based on the not-very-much participation that happened here:

Just as we measured back then, our “stressed” relay today has 4x the HardForkEncoderDisabledEra bad connection reports as the “normal” one, which is a much greater disparity than the bad Tx. Interesting that also implies the nodes that didn’t upgrade before the “deadline” still haven’t upgraded today (after 1 month) and therefore maybe never will.

Though we can’t use it because our logging (SimpleView) is text-based, and the blocking script (BlockNodesViaFail2Ban.md) requires JSON, at least others are reporting that this workaround fixes the load problem. If needed I could rewrite the script to get the IP’s to ban from the SimpleView logfile.

But before doing that… there must be a reason why the new node still accepts connections from the pre-1.24.2 nodes. It would have been simple for the 1.24.2 nodes to drop those connections immediately, and if it weren’t doing anything at all with the old nodes then it wouldn’t lead to increased CPU and memory use. Therefore some information must still be transferred (enough to keep eating memory) which might serve some valid purpose… and if so, banning the old nodes would block that.

@_ilap has had some sensible things to say on similar subjects so I would ask if he has some insight about this.

topologyUpdater should have a blacklist of nodes which is not working properly…

We may be getting some progress on the Github issue above, following up on the revised hypothesis that allowing connections from pre-Allegra nodes causes something that looks like a constant rate memory leak.

Our graph of memory use with the smooth diagonal edges (prior to memory clipping just below 100%, when it begins disk swapping) looks exactly like the graph in the Github issue:

Screenshot from 2021-01-11 19-39-21

The point at which swapping is apparent in the Disk I/O is generally when we restart the node, but the node development team has asked us (in this comment) to provide stats of incoming & outgoing connections so they can help diagnose the problem, which must be done while the problem is still happening (i.e. before the restart).

2 Likes

I have the same problem of one of my relay, this allowed me to implement leak detection and automatic restart of this node when the memory used reaches a certain threshold. You should never wait for swapping to begin before restarting you node, swapping may cause I/O to stale and your node may timeout on queries and appear dead to other peers.

A note about the “bad tx”, from what I’ve noticed these TX are not bad, they are just received several times and of course, only the first one is valid. This might be a logging concurrency problem, I think I saw a GitHub issue about this some time ago. So the more peers a node is connected to, the more “bad TX” it will receive, it might even be seen as a sign of good peer connectivity…

thanks, I hadn’t thought of that possibility… :sunglasses: