[IN PROGRESS] CPU load at maximum since 1.25.1

Raspool · 6 February 2021 19:01

Hello there,

Since 1.25.1, my relays use a lot of CPU constantly.

I had this issue when I used the 1.24.x because of the well-known “HardForkEncoderDisabledEra” issue.
But this is not the case here as each log doesn’t contain this error. In fact, there is no error at all.

The only thing I can see that is bugging me is the spam of “BadInputsUTxO” log like this one:

{"at":"2021-02-06T18:36:10.08Z","env":"1.25.1:9a733","ns":["cardano.node.Mempool"],
"data":{"tx":{"txid":"txid: TxId {_unTxId ="11b27c665310dc29765054f1f4aa3cf8087d2c4aa5d618be98ac84824effb82d\"}"},
"kind":"TraceMempoolRejectedTx","mempoolSize":{"numTxs":6,"bytes":3075},"err":
{"kind":"BadInputsUTxO","error":"The transaction contains inputs that do not exist in the UTxO set.","consumed":0,"badInputs":
[["c31b3d8a75d542466c8f9e87b2121868a17358d18e4213f0a48f267848d0fcd9",0]],
"produced":77411931261}},"app":],"msg":"","pid":"3290056","loc":null,"host":"ubuntu","sev":"Info","thread":"346"}

Doing a tail -f on each log display this line every second. It’s not a different “BadInputsUTxO” log for each line, but the very same one:

Has anyone experienced this recently? I haven’t found anything on Github about a similar issue.

Have a good weekend.

Raspool · 7 February 2021 09:43

Problem solved by disabling “Trace Mempool” in the config file.
CPU is now performing as it should.

EDIT : To clarify, disable Trace Mempool only on the relays. Keep it at “true” on the block-producer to be able to monitor transactions.

COSDpool · 7 February 2021 18:22

My main question: if you want to keep your relays running this way, how would you generally know the relay was accepting transactions at all?

I considered this as a somewhat drastic measure for one of our two relays. Unlike the other, it has peers who like to submit about 12 copies of transactions. The first in each of those groups succeeds, and of course all the following copies fail with invalid UTxO’s:

So it certainly keeps the load average down on the affected relay to turn off TraceMempool… presumably by saving the CPU load of having to mathematically invalidate the remaining ~11 transactions… but then unfortunately we have no confirmation the original, valid transaction is going through either.

Raspool · 8 February 2021 08:39

I turn TraceMempool off for the relays only, not for the block-producer.

That’s how I keep track of the growing transactions numbers without having CPU issue.

tsipou · 8 February 2021 10:26

do we have to change this parameter? is it officially?
i see also 15-20% more CPU after the upgrade.

maybe i will let few days only one relay with “false” option for this parameter, just to check how it affects the operation

Raspool · 8 February 2021 10:40

Hello, @tsipou

You don’t “have to” change this parameter. I did it because my relays were consuming a lot of CPU because of theses bad transactions.

Like any other Trace, it’s here to log more data but I assume it doesn’t change the core behavior of relays as my tip is always the highest and I still see the transactions number growing on the block-producer.

Here is the definition of the tracers:

tsipou · 8 February 2021 11:36

ok i see you point and case.
thank you for responding

Raspool · 9 February 2021 07:49

If i understand this correctly, is someone sending redundant transactions to the blockchain? Is there any chance this is malicious? Almost like a DDoS?

It seems so. Saying it’s like a DDoS is a step I won’t take, but it can be quite expensive in term of ressources.

It’s rough for us to loose this visibility by turning Mempool trace to false.

As I said and did, you can keep the Mempool trace to “true” on the block-producer only. This way, you are still able to keep track of your transactions number growing, without the need of upgrading your infra because of this spam.

COSDpool · 10 February 2021 18:18

I’ve posed this question (definitely not “solved” yet) officially here:

Raspool · 10 February 2021 19:16

Excellent!

I wouldn’t have said it better as I’m not fluent in english, but it describes with clarity and concision the issue.

I’m looking forward to hearing from the devs.

EDIT : I’ve unchecked my previous “solved” reply and change the post title to reflect the current situation.

Raspool · 11 February 2021 10:02

Apparently, it’s going to be addressed:

wouter_CAPEX · 14 February 2021 08:31

Thanks for filing this guys. I have the same issue since the hard fork.
For now I have mitigated it be assigning more resources to the relay nodes.

Important to start the node with flags +RTS -N4 -RTS for 4 cores, or +RTS -N8 -RTS, otherwise it will stick at utilizing no more then 2 cores in paralel.

It would be great to get some insights into what the “TraceMempool” flag in the config actually does, and what the effect of turning it off is. The different between on and off in CPU usage is ~ a factor 8 in my case. (~5% when off => ~40% when on), and it results in ~10k log messages an hour:
I can not believe that the mere logging, results in this much CPU usage.

If the point of the setting is not about the logging of transaction, but about the actual settling / validating of them, there are some other questions:

Why does the node attempt to validate the same transaction multiple times (logically resulting in everything after the first time failing)
What is the added value of a node with “TraceMempool” off, if it results in it no longer validating the transactions? Is it enough to do this only on BP-Nodes? What is the effect on the network if we would turn this trace off on all relays?

Raspool · 14 February 2021 10:01

Hello,

I think the link from my previous reply give the answer for your first point.

For the second point, all the Trace are used to log what’s happening on your relays/block-producer. Turning them on/off only affect the informations you want to display or save in text file.
This doesn’t affect the core behaviour of the node.
I only turned it ON on the Block-producer and was able to create blocks.

wouter_CAPEX · 14 February 2021 14:58

Hey,

I see what you are saying, but as I mentioned in my previous message, it is hard for me to believe that just the logging of ~10k lines will cost ~35% of my CPU capacity (that would be some crappy coding if it does)… so there must be more going on! I simply want to understand what that is, and what the effect is of turning it off, not just for my BPnode, but for the network as a whole.

_2072 · 28 February 2021 23:15

I think that this commit explains the problem: Fix performance regression in shelley-ma · input-output-hk/cardano-ledger-specs@149c8ec · GitHub

It’s just that enabling tracing of mempool data implies creating a log entry detailing the problem listing the utxo involved in a very inefficient way, this was fix for Shelley in this PR but removed when they switched to the Mary era…

wouter_CAPEX · 1 March 2021 20:34

Thanks! That explains a lot. So the issue should be gone again in the next version.

_2072 · 17 March 2021 22:53

This should be fixed in 1.26.0 tagged today:

Fixed a performance regression in the Allegra and Mary eras. This resulted in
node slowdown when validating failed transactions. (#2144)

wones0120 · 6 April 2021 00:59

I am having the CPU problem on the block-producer node. I need to upgrade (2 cpu, 8gb ram) but is there harm in setting it to false on block-producer?

Raspool · 6 April 2021 06:44

Hello @wones0120

There is no harm setting it to false. It is just here to output informations, it doesn’t change your block-producer behaviour.

The only downside is that you won’t be able to see if you’re processing transactions.

Zoster · 6 April 2021 15:15

It’s very good replies for me. But I think can’t check the tx is meaning of I don’t know my pool status. So I changed 1 relay config. We need to trace that issue.

Topic		Replies	Views
Tip(diff), changed the parameter TraceMempool Operate a Stake Pool	15	1533	6 April 2021
V1.25.1 relay node ram usage very high - is that a bug? Operate a Stake Pool	36	2379	7 March 2021
Relay not producing transactions Operate a Stake Pool	4	504	25 November 2021
Error: Switchboard's queue full, dropping log items! Staking & Delegation	1	679	4 March 2021
Pool not processing transaction anymore Operate a Stake Pool	3	245	22 April 2023

[IN PROGRESS] CPU load at maximum since 1.25.1

Related topics