V1.25.1 relay node ram usage very high - is that a bug?

tigrpool.com · 6 February 2021 21:55

Hello Cardanians,

i’am setting up my nodes right now for the new tigr pool. I started first with a relay node on the testnet.
During this testing i saw that RAM is increasing steadily over time, which raises the question if there is a memory leak in the V1.25.1 version ?

After 14h40 mins: 3.83GB RAM usage. 33 peers in topology

After 32 mins: 2.5GB RAM usage. 33 peers in topology.

I read in the forum that they have around 700MB / 1000MB ram usage per Relay even on longer runtimes. 3.83GB for serving 33 peers sounds super inefficient - is something wrong with my node ? Where does this high memory consumption come from ?.

Normally i deal with c# backend applications facing multiple hundreds of users with less than a GB of RAM usage. Thats a huge difference here.

I’ve checked the logs:

If found:
[hn-thub1:cardano.node.DnsSubscription:Warning:84] [2021-02-06 21:40:48.82 UTC] Domain: “relay1.osterlindh.com” Failed to start all required subscriptions
[hn-thub1:cardano.node.DnsSubscription:Warning:77] [2021-02-06 21:40:48.82 UTC] Domain: “iodc.hopto.org” Failed to start all required subscriptions
… (9 more of those)

And

[hn-thub1:cardano.node.IpSubscription:Error:190] [2021-02-06 21:41:44.58 UTC] IPs: 0.0.0.0:0 [hn-thub1:cardano.node.IpSubscription:Error:715] [2021-02-06 22:15:39.99 UTC] IPs: 0.0.0.0:0 [192.168.16.1:50000,192.168.16.2:50000,192.168.16.3:50000,51.79.141.170:7900,95.216.178.106:3001,95.217.133.234:6000,95.179.169.157:6600,116.203.233.9:3002,185.173.235.164:5001,18.132.238.21:3001,3.9.209.70:6000,146.166.116.172:7172,146.166.116.170:7170,24.37.174.13:3005,157.245.131.60:6000,79.97.151.246:30000,54.241.77.32:3001,209.126.3.185:7031,104.198.217.123:3010,3.135.9.245:6001,3.14.16.248:3001,198.0.113.61:3001,40.76.58.6:6000] Connection Attempt Exception, destination 192.168.16.2:50000 exception: Network.Socket.connect: <socket: 36>: timeout (Connection timed out)
[hn-thub1:cardano.node.IpSubscription:Error:59] [2021-02-06 22:15:39.99 UTC] IPs: 0.0.0.0:0 [192.168.16.1:50000,192.168.16.2:50000,192.168.16.3:50000,51.79.141.170:7900,95.216.178.106:3001,95.217.133.234:6000,95.179.169.157:6600,116.203.233.9:3002,185.173.235.164:5001,18.132.238.21:3001,3.9.209.70:6000,146.166.116.172:7172,146.166.116.170:7170,24.37.174.13:3005,157.245.131.60:6000,79.97.151.246:30000,54.241.77.32:3001,209.126.3.185:7031,104.198.217.123:3010,3.135.9.245:6001,3.14.16.248:3001,198.0.113.61:3001,40.76.58.6:6000] Failed to start all required subscriptions

[2021-02-06 22:14:42.78 UTC] IPs: 0.0.0.0:0 [51.79.141.170:7900,95.216.178.106:3001,95.217.133.234:6000,95.179.169.157:6600,116.203.233.9:3002,185.173.235.164:5001,18.132.238.21:3001,3.9.209.70:6000,146.166.116.172:7172,146.166.116.170:7170,24.37.174.13:3005,157.245.131.60:6000,79.97.151.246:30000,54.241.77.32:3001,209.126.3.185:7031,104.198.217.123:3010,3.135.9.245:6001,3.14.16.248:3001,198.0.113.61:3001,40.76.58.6:6000] Application Exception: 104.198.217.123:3010 ExceededTimeLimit (ChainSync (Header (HardForkBlock (‘: * ByronBlock (’: * (ShelleyBlock (ShelleyEra StandardCrypto)) (‘: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (Tip HardForkBlock (’: * ByronBlock (‘: * (ShelleyBlock (ShelleyEra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Allegra StandardCrypto)) (’: * (ShelleyBlock (ShelleyMAEra ‘Mary StandardCrypto)) (’ *))))))) (ServerAgency TokNext TokCanAwait)

→ I assume that this is the badboy.

Any thoughts on that ?

_cxh_x · 14 February 2021 03:18

I am not sure I know the answer but I it seemed to me that when I added a bunch of relays it also corresponded with having more memory consumed. So less relays will equal less memory. Also seems another parallel process is spun up which may also take up cpu. I actually ended up having to reboot when I had 70 relays in my topology.

Alexd1985 · 14 February 2021 07:38

70? The recomandation number is ~ 20

Cheers!

_cxh_x · 14 February 2021 07:42

Indeed I had not gotten around to reading that. So learned via pain

tigrpool.com · 14 February 2021 16:25

Indeed there is a correlation. More peers equals more ram usage.
Lets assume the node uses 500 MB in idle mode it would then need 3330MB for 33 peers.
Thats 100MB per node connected, thats way too much (?). I would say there is room to improve for the future. I might ask that question directly on the github page.

From the other responses i understand, it behaves the same on your side and the solution you chose is to limit the peers. I will do further long term tests next weeks, and if it stays on a certain level (does not raise over time), i will leave it as it is.

Thanks for the feedback!

laplasz · 14 February 2021 18:21

@tigrpool.com could you please mark one of the answers as solution if helped on your question?

tigrpool.com · 15 February 2021 13:53

@laplasz : Hello laplasz, the question is actually not answered. I just expect from the low resonance that people have the same issue and that they accept the high memory usage.
Hence so far i would not like to say the topic is “solved”.

I will set it solved if some one else confirmes my theory.

dstratio · 16 February 2021 22:41

I also noticed more RAM usage on the turn of last epoch with the new 1.25.1 version
The 4GB is not enough anymore. The spike RAM usage is not directly related to the number of peers as I consistently have c. 14 to 15 peers on both relays and the RAM utilisation spiked by about 20%.

I did not investigate further, but rather expected that the utilisation keeps growing as more functionality gets added in the run up to the go live of Goguen

w0lverin3 · 18 February 2021 10:43

I am actually having the same issue. Suppose we are using the topologyupdater.sh, it’s causing us to have more relays, I think. Does anyone have any suggestions on how to remediate the issue?

Alexd1985 · 18 February 2021 10:47

how many peers do you have?
also try to set to default “TraceMempool”: true,
try “TraceMempool”: false, in your config file, try to restart the node after and keep it under monitoring

tigrpool.com · 24 February 2021 10:47

That one really is a good advise.

how many peers do you have?
also try to set to default “TraceMempool”: true,
try “TraceMempool”: false, in your config file, try to restart the node after and keep it under monitoring

It lowered the memory consuption (not that much, but it helped).

Thanks.

Anti.biz · 24 February 2021 19:33

What is the command your using to bring up those stats I am curious to monitor mystuff myself.

tigrpool.com · 27 February 2021 15:17

Hello @Anti.biz,
i use standard linux tools for that, they are available in all usually known distributions

Monitoring of CPU, RAM and Processes use HTOP

apt-get -y install htop;
htop;

Monitoring of Disk usage use iotop

apt-get -y install iotop;
iotop;

Monitoring of Network bandwith and usage use iftop

apt-get -y install iftop;
iftop;

To install all at once use:

apt-get -y install htop iotop iftop;

Good luck!

Anton_R · 28 February 2021 15:54

Hi @tigrpool.com have you solved the issue with memory.
I have the same issue memory slowly climbing to 100% every 24 hours, so I have to restart cnode service manually. both relays are behaving identically.
In my topologyUpdater Max_peers set to 14 and nodes are running 17 out and 7 in
When I disable TraceMempool I do not see Processed transactions and in this case hard to say if relay is processing transactions.

CryptoTorben · 28 February 2021 16:03

Have you tried to disable TraceMempool as Alex1985 suggested earlier? It did the job for my pool.

Anton_R · 28 February 2021 16:49

Hi @CryptoTorben thank you for reply. Yes I did, but I do not like idea not being able to see Processed transactions. and just wondering if this will be fixed in the nearest future.

CryptoTorben · 28 February 2021 16:55

As long as you are able to see processed transactions on the Block Producer everything should be fine, that aside, I would like to see it fixed as well.

Anton_R · 28 February 2021 17:16

@CryptoTorben so all of your relays are running with TraceMempool disabled ?

CryptoTorben · 28 February 2021 17:18

Yea, and the BP is crunching away ;o)

COSDpool · 28 February 2021 17:24

You can subscribe to this issue to see how they’re doing on fixing this in Ouroboros (so the fix can then be incorporated into the next release of the node):

github.com/input-output-hk/ouroboros-network

Emit a unique Mempool tracer event for duplicate txs

opened 10:23PM - 10 Feb 21 UTC

closed 12:46AM - 07 Apr 21 UTC

nfrisby

good first issue

consensus

Node and Consensus chatted about input-output-hk/cardano-node#2350 on Slack. Thi…s Issue captures our first thought about how to assist. The SPOs are seeing high CPU usage until they disable mempool tracing. The leading suspect is that we're seeing duplicate rejection events when we fetch the same tx from multiple peers. See that `cardano-node` ticket for more info. The node's TxSub logic -- in particular contrast to BlockFetch -- makes no attempt to fetch a tx from only one peer. In other words, the various TxSub server threads running on a node only first synchronize by adding their fetched txs to the node's single mempool. In general, only the first of the per-peer TxSub server thread to do so will be able to add each tx. So we expect these duplicate rejections given our design (favoring the simplicity of not prematurely synchronizing the TxSub server threads), and have not seen them causing negligible (edit: I meant "non-negligible") CPU/network overhead eg. This is supported by the SPOs seeing less utilization by merely disabling the mempool tracers. This Issue is therefore to emit a distinct event for duplicate transactions, separate from txs that are rejected for more interesting reasons. This will allow the node team to easily avoid the apparent expense of these duplicate tx rejection ids without losing all the mempool tracing logs.

Topic		Replies	Views
Newbie question regarding memory usage Operate a Stake Pool	5	584	7 September 2021
1.26.2 high memory usage Operate a Stake Pool	9	1000	25 April 2021
1.26.1 and memory usage with traceMemPool Operate a Stake Pool	4	584	14 April 2021
Sudden 1GB RAM usage increases on both relay+bp Operate a Stake Pool	12	781	26 April 2021
Upgraded to 1.29.0 - High memory usage Operate a Stake Pool	37	2957	9 September 2021

V1.25.1 relay node ram usage very high - is that a bug?

Related topics