What happened ? Producer and both relays down this morning

This morning I noticed my producer was only connected to 1 relay coming in and none going out. When I checked relay #1 the node service had been restarting for 2 hours…I restarted Relay #1 and it wont come back up. What are the chances both relays would have issues on the same morning?

journalctl -e -f -u cnode.service
– Logs begin at Tue 2021-08-31 16:26:48 UTC. –
Aug 31 16:27:02 relay1 systemd[1]: Started Cardano Node.
Aug 31 16:27:04 relay1 cnode[747]: Listening on http://127.0.0.1:12798
Aug 31 16:38:26 relay1 systemd[1]: Stopping Cardano Node…
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: State ‘stop-sigterm’ timed out. Killing.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Killing process 747 (bash) with signal SIGKILL.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Killing process 817 (cardano-node) with signal SIGKILL.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Main process exited, code=killed, status=9/KILL
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Killing process 817 (cardano-node) with signal SIGKILL.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Failed with result ‘timeout’.
Aug 31 16:38:31 relay1 systemd[1]: Stopped Cardano Node.
Aug 31 16:38:38 relay1 systemd[1]: Started Cardano Node.
Aug 31 16:38:39 relay1 cnode[10270]: Listening on http://127.0.0.1:12798
Aug 31 16:56:21 relay1 cnode[10270]: /opt/cardano/cnode/scripts/cnode.sh: line 57: 10338 Killed cardano-node “${CPU_RUNTIME[@]}” run --topology “${TOPOLOGY}” --config “${CONFIG}” --database-path “${DB_DIR}” --socket-path “${CARDANO_NODE_SOCKET_PATH}” --port ${CNODE_PORT} “${host_addr[@]}”
Aug 31 16:56:21 relay1 systemd[1]: cnode.service: Main process exited, code=exited, status=137/n/a
Aug 31 16:56:21 relay1 systemd[1]: cnode.service: Failed with result ‘exit-code’.
Aug 31 16:56:26 relay1 systemd[1]: cnode.service: Service RestartSec=5s expired, scheduling restart.
Aug 31 16:56:26 relay1 systemd[1]: cnode.service: Scheduled restart job, restart counter is at 1.
Aug 31 16:56:26 relay1 systemd[1]: Stopped Cardano Node.
Aug 31 16:56:33 relay1 systemd[1]: Started Cardano Node.
Aug 31 16:56:34 relay1 cnode[4308]: WARN: A prior running Cardano node was not cleanly shutdown, socket file still exists. Cleaning up.
Aug 31 16:56:35 relay1 cnode[4308]: Listening on http://127.0.0.1:12798

EDIT I havent changed anything on the servers in months.

Relay #2 seems fine but is not connecting to the producer.

Was there a major change last night? 1.27 not longer a working version?

./gLiveView.sh
COULD NOT CONNECT TO A RUNNING INSTANCE, 3 FAILED ATTEMPTS IN A ROW!

After restarting the producer.

sudo systemctl status cnode
● cnode.service - Cardano Node
Loaded: loaded (/etc/systemd/system/cnode.service; enabled; vendor preset: en
Active: active (running) since Tue 2021-08-31 17:24:33 UTC; 12min ago
Main PID: 615 (bash)
Tasks: 10 (limit: 4915)
Memory: 97.3M
CGroup: /system.slice/cnode.service
├─615 bash /opt/cardano/cnode/scripts/cnode.sh
└─683 cardano-node run --topology /opt/cardano/cnode/files/topology.j

Aug 31 17:24:33 producer1 systemd[1]: Started Cardano Node.
Aug 31 17:24:34 producer1 cnode[615]: Listening on http://127.0.0.1:12798

Hi, do you have enough disk space? How much RAM do you have and what is the RAM usage?

In case RAM is running low, according to Coincashew.com instructions you can change the following setting on the Relay nodes:

free -m
total used free shared buff/cache available
Mem: 7980 232 7567 8 180 7521
Swap: 0 0 0

udev 4076732 0 4076732 0% /dev
tmpfs 817204 8524 808680 2% /run
/dev/sda1 51410408 23952852 25245804 49% /
tmpfs 4086012 0 4086012 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 4086012 0 4086012 0% /sys/fs/cgroup
/dev/sda15 126710 6148 120562 5% /boot/efi
tmpfs 817200 0 817200 0% /run/user/1000

The 1.29 update seems to have resolved the issue for Relay 1… now doing the producer.

Do you have some monitoring tool configured, like Prometheus or similar, where you can track memory and CPU consumption over time?

Its GC server, I cant get the monitoring service to install. I can see the CPU usage

GC like “Google Cloud” ?

In case you have access to kernel logs, see if you can find info if the kernel killed the process because of low memory.

Yes, GCP…

The two GCP servers are 2 core 8 gigs, and about $55 each a month. I’m going to move them to a much less expensive host where they’ll have 6 core and 16gigs for $12 each a month. That should resolve any possible ram issues. I checked my 2nd relay that didnt shutdown and its right at 8gigs usage.

Try checking the logs but I assume the process was killed by the kernel due to low memory.