[IN PROGRESS] CPU load at maximum since 1.25.1

Hello there,

Since 1.25.1, my relays use a lot of CPU constantly.

I had this issue when I used the 1.24.x because of the well-known “HardForkEncoderDisabledEra” issue.
But this is not the case here as each log doesn’t contain this error. In fact, there is no error at all.

The only thing I can see that is bugging me is the spam of “BadInputsUTxO” log like this one:

{"at":"2021-02-06T18:36:10.08Z","env":"1.25.1:9a733","ns":["cardano.node.Mempool"],
"data":{"tx":{"txid":"txid: TxId {_unTxId ="11b27c665310dc29765054f1f4aa3cf8087d2c4aa5d618be98ac84824effb82d\"}"},
"kind":"TraceMempoolRejectedTx","mempoolSize":{"numTxs":6,"bytes":3075},"err":
{"kind":"BadInputsUTxO","error":"The transaction contains inputs that do not exist in the UTxO set.","consumed":0,"badInputs":
[["c31b3d8a75d542466c8f9e87b2121868a17358d18e4213f0a48f267848d0fcd9",0]],
"produced":77411931261}},"app":],"msg":"","pid":"3290056","loc":null,"host":"ubuntu","sev":"Info","thread":"346"}

Doing a tail -f on each log display this line every second. It’s not a different “BadInputsUTxO” log for each line, but the very same one:

Has anyone experienced this recently? I haven’t found anything on Github about a similar issue.

Have a good weekend.

2 Likes

Problem solved by disabling “Trace Mempool” in the config file.
CPU is now performing as it should.

EDIT : To clarify, disable Trace Mempool only on the relays. Keep it at “true” on the block-producer to be able to monitor transactions.

3 Likes

My main question: if you want to keep your relays running this way, how would you generally know the relay was accepting transactions at all?

I considered this as a somewhat drastic measure for one of our two relays. Unlike the other, it has peers who like to submit about 12 copies of transactions. The first in each of those groups succeeds, and of course all the following copies fail with invalid UTxO’s:

So it certainly keeps the load average down on the affected relay to turn off TraceMempool… presumably by saving the CPU load of having to mathematically invalidate the remaining ~11 transactions… but then unfortunately we have no confirmation the original, valid transaction is going through either. :confounded:

I turn TraceMempool off for the relays only, not for the block-producer.

That’s how I keep track of the growing transactions numbers without having CPU issue.

do we have to change this parameter? is it officially?
i see also 15-20% more CPU after the upgrade.

maybe i will let few days only one relay with “false” option for this parameter, just to check how it affects the operation

Hello, @tsipou

You don’t “have to” change this parameter. I did it because my relays were consuming a lot of CPU because of theses bad transactions.

Like any other Trace, it’s here to log more data but I assume it doesn’t change the core behavior of relays as my tip is always the highest and I still see the transactions number growing on the block-producer.

Here is the definition of the tracers:

1 Like

ok i see you point and case.
thank you for responding

If i understand this correctly, is someone sending redundant transactions to the blockchain? Is there any chance this is malicious? Almost like a DDoS?

I’m seeing this too. I’ve decided to upgrade our infrastructure to handle it for now. We alarm on transactions on our relays. It’s one of the main checks we have to make sure we are in good health. It’s rough for us to loose this visibility by turning Mempool trace to false.

image

1 Like

If i understand this correctly, is someone sending redundant transactions to the blockchain? Is there any chance this is malicious? Almost like a DDoS?

It seems so. Saying it’s like a DDoS is a step I won’t take, but it can be quite expensive in term of ressources.

It’s rough for us to loose this visibility by turning Mempool trace to false.

As I said and did, you can keep the Mempool trace to “true” on the block-producer only. This way, you are still able to keep track of your transactions number growing, without the need of upgrading your infra because of this spam.

2 Likes

I’ve posed this question (definitely not “solved” yet) officially here:

Excellent!

I wouldn’t have said it better as I’m not fluent in english, but it describes with clarity and concision the issue.

I’m looking forward to hearing from the devs.

EDIT : I’ve unchecked my previous “solved” reply and change the post title to reflect the current situation.

1 Like

Apparently, it’s going to be addressed:

2 Likes

Thanks for filing this guys. I have the same issue since the hard fork.
For now I have mitigated it be assigning more resources to the relay nodes.

Important to start the node with flags +RTS -N4 -RTS for 4 cores, or +RTS -N8 -RTS, otherwise it will stick at utilizing no more then 2 cores in paralel.

It would be great to get some insights into what the “TraceMempool” flag in the config actually does, and what the effect of turning it off is. The different between on and off in CPU usage is ~ a factor 8 in my case. (~5% when off => ~40% when on), and it results in ~10k log messages an hour:
I can not believe that the mere logging, results in this much CPU usage.

If the point of the setting is not about the logging of transaction, but about the actual settling / validating of them, there are some other questions:

  1. Why does the node attempt to validate the same transaction multiple times (logically resulting in everything after the first time failing)
  2. What is the added value of a node with “TraceMempool” off, if it results in it no longer validating the transactions? Is it enough to do this only on BP-Nodes? What is the effect on the network if we would turn this trace off on all relays?

Hello,

I think the link from my previous reply give the answer for your first point.

For the second point, all the Trace are used to log what’s happening on your relays/block-producer. Turning them on/off only affect the informations you want to display or save in text file.
This doesn’t affect the core behaviour of the node.
I only turned it ON on the Block-producer and was able to create blocks.

Hey,

I see what you are saying, but as I mentioned in my previous message, it is hard for me to believe that just the logging of ~10k lines will cost ~35% of my CPU capacity (that would be some crappy coding if it does)… so there must be more going on! I simply want to understand what that is, and what the effect is of turning it off, not just for my BPnode, but for the network as a whole.

I think that this commit explains the problem: Fix performance regression in shelley-ma · input-output-hk/cardano-ledger-specs@149c8ec · GitHub

It’s just that enabling tracing of mempool data implies creating a log entry detailing the problem listing the utxo involved in a very inefficient way, this was fix for Shelley in this PR but removed when they switched to the Mary era…

1 Like

Thanks! That explains a lot. So the issue should be gone again in the next version.

This should be fixed in 1.26.0 tagged today:

  • Fixed a performance regression in the Allegra and Mary eras. This resulted in
    node slowdown when validating failed transactions. (#2144)
7 Likes

I am having the CPU problem on the block-producer node. I need to upgrade (2 cpu, 8gb ram) but is there harm in setting it to false on block-producer?

Hello @wones0120

There is no harm setting it to false. It is just here to output informations, it doesn’t change your block-producer behaviour.

The only downside is that you won’t be able to see if you’re processing transactions.