Increasing numbers of OOM crashes on BP since 1.34.1 and now won't come back up

Howdy SPO folks.

My BP is running on a 4-core virtual server with 16GB RAM and 4GB swap, which has seemed plenty until recently. A week or two ago it started periodically crashing with out-of-memory errors, and now it won’t get past “starting” for an hour and then crashing again.

It seems more unstable after the 1.34.1 upgrade.

Any ideas?

Thanks!

PS some log data: before crashing it is doing this:

{“host”:“ubuntu-1”,“pid”:“18129”,“loc”:null,“at”:“2022-03-31T12:04:39.00Z”,“ns”:[“cardano.node.LeadershipCheck”],“sev”:“Info”,“env”:“1.34.1:73f9a”,“data”:{“utxoSize”:6453179,“kind”:“TraceStartLeadershipCheck”,“delegMapSize”:1178218,“credentials”:“Cardano”,“slot”:57161988,“chainDensity”:4.717579e-2},“msg”:“”,“thread”:“5185”,“app”:}
{“host”:“ubuntu-1”,“pid”:“18129”,“loc”:null,“at”:“2022-03-31T12:04:39.00Z”,“ns”:[“cardano.node.Forge”],“sev”:“Info”,“env”:“1.34.1:73f9a”,“data”:{“val”:{“kind”:“TraceNodeNotLeader”,“slot”:57161988},“credentials”:“Cardano”},“msg”:“”,“thread”:“5185”,“app”:}

Then after restarting and running for 30 mins it was doing this when it crashed:

{“host”:“ubuntu-1”,“pid”:“148130”,“loc”:null,“at”:“2022-03-31T12:29:22.68Z”,“ns”:[“cardano.node.ChainDB”],“sev”:“Info”,“env”:“1.34.1:73f9a”,“data”:{“kind”:“TraceImmutableDBEvent.ValidatedChunk”,“chunkNo”:“2237”},“msg”:“”,“thread”:“5”,“app”:}
{“host”:“ubuntu-1”,“pid”:“148130”,“loc”:null,“at”:“2022-03-31T12:29:22.68Z”,“ns”:[“cardano.node.ChainDB”],“sev”:“Info”,“env”:“1.34.1:73f9a”,“data”:{“finalChunk”:“2644”,“initialChunk”:“2238”,“kind”:“TraceImmutableDBEvent.StartedValidatingChunk”},“msg”:“”,“thread”:“5”,“app”:}

It doesn’t look abnormal to me, and htop was reporting reasonable memory usage (about 6GB) shortly before the crash.

The crash is reported in dmesg as:

[Thu Mar 31 14:04:39 2022] Out of memory: Killed process 18129 (cardano-node) total-vm:1074654484kB, anon-rss:8453716kB, file-rss:1528kB, shmem-rss:0kB, UID:1000 pgtables:24928kB oom_score_adj:0

Mystified!

I would suggest increasing SWAP size and seeing what’s going on… maybe there are other services that you are running on the server and using the RAM? for me 1.34.1 is running quite well on all servers.

1 Like

tnx, it did indeed seem to max out swap on the latest run, so I’ve bumped it to 16gb, crossing my fingers :slight_smile:

1 Like

16G of RAM (without SWAP) should be enough if you are not using another scripts such as cncli leaderlog

1 Like

There was some hard work done by one of our SPO’s on garbage collection tuning. If you haven’t read it yet it is worth your time.

Tnx folks, more swap seems to have resuscitated it for now :slight_smile: