All nodes keep restarting?

Hi all,

For some reason two relay nodes of mine keep restarting now. I seem them connect for a short amount and then disappear again. I checked the hardware usage and all seemed fine (the 2 nodes are running on 8vCPU, 8GB MEM, 80GB Disk each).

When I run the journal command I get the following on both nodes:

Aug 17 00:24:09 -: /opt/cardano/cnode/scripts/cnode.sh: line 57: 1549347 Killed                  cardano-node "${CPU_RUNTIME[@]}" run --topology "${TOPOLOGY}" --config "${CONFIG                                }" --database-path "${DB_DIR}" --socket-path "${CARDANO_NODE_SOCKET_PATH}" --port ${CNODE_PORT} "${host_addr[@]}"
Aug 17 00:24:24 -: WARN: A prior running Cardano node was not cleanly shutdown, socket file still exists. Cleaning up.
Aug 17 00:24:26 -: Listening on http://0.0.0.0:12798

I tried shutting down the node and exporting a new socket file (I checked the folder path and it is correct).

export CARDANO_NODE_SOCKET_PATH=/opt/cardano/cnode/sockets/node0.socket

This doesn’t seem to do the trick however. I checked the CPU usage after a couple of minutes (both of them are sitting at 100% when looking via top (the dashboard of the host is saying 25~30% however). I also see that since the nodes are offline the disks are reading quite a bit. However, there is no weird incoming traffic or behavior on the server.

I also checked the gLiveView.sh. The nodes remain on status “starting…”. What am I overlooking? How might I fix this issue?

Help is much appreciated!

Hi,

It is very common for a node to restart if they lack memory. Two nodes on 8Gb will definitely restart.
A node process will consume around 7.5Gb - 8.5Gb of ram. So do the maths :slight_smile:
If you have multiple relays on a single VM, provision at least 8Gb per instance. And don’t forget that the OS needs to eat too.

Thanks for answering. Each node has its own VM with 8GB. I don’t know how both nodes can run for months on end and just now both keep restarting at the same time. The nodes are in different places so it can’t be a physical connection/ error either.

What does “free -h” say?

Both present similar numbers

              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       287Mi       119Mi       0.0Ki       7.4Gi       7.3Gi
Swap:            0B          0B          0B

Right now it might have to do with the node socket again (restarted them a couple of minutes ago)

It is probably just restarted and goes through the usual DB scan. Once it runs for a bit it will average at around 7.5. Mine currently runs at 7.973Gb.
In any case, my bet would be on memory. Even if it did run ok before - I’d beef it up. Do an experiment, double the ram and see how it goes. You can always change it back.
If you are saying that they’d been running like that before then it could be something else…

These are the numbers when your nodes are “relaxed”, but they actually consume around 7GB when running under stress. So you need to upgrade your hardware.

1 Like

Go to configuration file and set the TraceMempool to false save the file and restart the node

It’s starting now?

1 Like

Just woke up (after a couple of hours of updating, upgrading, checking and restarting the nodes and letting them sync and take their time). Both nodes seem to be working again and are stable once more. The hardware seems to be sufficient too.

I will keep the TraceMempool change in mind in case the nodes are behaving weird again.