What happened ? Producer and both relays down this morning

Seshat · 31 August 2021 17:18

This morning I noticed my producer was only connected to 1 relay coming in and none going out. When I checked relay #1 the node service had been restarting for 2 hours…I restarted Relay #1 and it wont come back up. What are the chances both relays would have issues on the same morning?

journalctl -e -f -u cnode.service
– Logs begin at Tue 2021-08-31 16:26:48 UTC. –
Aug 31 16:27:02 relay1 systemd[1]: Started Cardano Node.
Aug 31 16:27:04 relay1 cnode[747]: Listening on http://127.0.0.1:12798
Aug 31 16:38:26 relay1 systemd[1]: Stopping Cardano Node…
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: State ‘stop-sigterm’ timed out. Killing.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Killing process 747 (bash) with signal SIGKILL.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Killing process 817 (cardano-node) with signal SIGKILL.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Main process exited, code=killed, status=9/KILL
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Killing process 817 (cardano-node) with signal SIGKILL.
Aug 31 16:38:31 relay1 systemd[1]: cnode.service: Failed with result ‘timeout’.
Aug 31 16:38:31 relay1 systemd[1]: Stopped Cardano Node.
Aug 31 16:38:38 relay1 systemd[1]: Started Cardano Node.
Aug 31 16:38:39 relay1 cnode[10270]: Listening on http://127.0.0.1:12798
Aug 31 16:56:21 relay1 cnode[10270]: /opt/cardano/cnode/scripts/cnode.sh: line 57: 10338 Killed cardano-node “${CPU_RUNTIME[@]}” run --topology “${TOPOLOGY}” --config “${CONFIG}” --database-path “${DB_DIR}” --socket-path “${CARDANO_NODE_SOCKET_PATH}” --port ${CNODE_PORT} “${host_addr[@]}”
Aug 31 16:56:21 relay1 systemd[1]: cnode.service: Main process exited, code=exited, status=137/n/a
Aug 31 16:56:21 relay1 systemd[1]: cnode.service: Failed with result ‘exit-code’.
Aug 31 16:56:26 relay1 systemd[1]: cnode.service: Service RestartSec=5s expired, scheduling restart.
Aug 31 16:56:26 relay1 systemd[1]: cnode.service: Scheduled restart job, restart counter is at 1.
Aug 31 16:56:26 relay1 systemd[1]: Stopped Cardano Node.
Aug 31 16:56:33 relay1 systemd[1]: Started Cardano Node.
Aug 31 16:56:34 relay1 cnode[4308]: WARN: A prior running Cardano node was not cleanly shutdown, socket file still exists. Cleaning up.
Aug 31 16:56:35 relay1 cnode[4308]: Listening on http://127.0.0.1:12798

EDIT I havent changed anything on the servers in months.

Seshat · 31 August 2021 17:37

Relay #2 seems fine but is not connecting to the producer.

Was there a major change last night? 1.27 not longer a working version?

./gLiveView.sh
COULD NOT CONNECT TO A RUNNING INSTANCE, 3 FAILED ATTEMPTS IN A ROW!

After restarting the producer.

sudo systemctl status cnode
● cnode.service - Cardano Node
Loaded: loaded (/etc/systemd/system/cnode.service; enabled; vendor preset: en
Active: active (running) since Tue 2021-08-31 17:24:33 UTC; 12min ago
Main PID: 615 (bash)
Tasks: 10 (limit: 4915)
Memory: 97.3M
CGroup: /system.slice/cnode.service
├─615 bash /opt/cardano/cnode/scripts/cnode.sh
└─683 cardano-node run --topology /opt/cardano/cnode/files/topology.j

Aug 31 17:24:33 producer1 systemd[1]: Started Cardano Node.
Aug 31 17:24:34 producer1 cnode[615]: Listening on http://127.0.0.1:12798

mcrio · 31 August 2021 18:25

Hi, do you have enough disk space? How much RAM do you have and what is the RAM usage?

mcrio · 31 August 2021 18:29

In case RAM is running low, according to Coincashew.com instructions you can change the following setting on the Relay nodes:

Seshat · 31 August 2021 18:32

free -m
total used free shared buff/cache available
Mem: 7980 232 7567 8 180 7521
Swap: 0 0 0

udev 4076732 0 4076732 0% /dev
tmpfs 817204 8524 808680 2% /run
/dev/sda1 51410408 23952852 25245804 49% /
tmpfs 4086012 0 4086012 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 4086012 0 4086012 0% /sys/fs/cgroup
/dev/sda15 126710 6148 120562 5% /boot/efi
tmpfs 817200 0 817200 0% /run/user/1000

Seshat · 31 August 2021 19:49

The 1.29 update seems to have resolved the issue for Relay 1… now doing the producer.

mcrio · 31 August 2021 19:52

Do you have some monitoring tool configured, like Prometheus or similar, where you can track memory and CPU consumption over time?

Seshat · 31 August 2021 19:57

Its GC server, I cant get the monitoring service to install. I can see the CPU usage

mcrio · 31 August 2021 20:02

GC like “Google Cloud” ?

In case you have access to kernel logs, see if you can find info if the kernel killed the process because of low memory.

Seshat · 31 August 2021 20:08

Yes, GCP…

The two GCP servers are 2 core 8 gigs, and about $55 each a month. I’m going to move them to a much less expensive host where they’ll have 6 core and 16gigs for $12 each a month. That should resolve any possible ram issues. I checked my 2nd relay that didnt shutdown and its right at 8gigs usage.

mcrio · 31 August 2021 20:11

Try checking the logs but I assume the process was killed by the kernel due to low memory.

Topic		Replies	Views
Restarted a working Relay node now it wont connect Community Technical Support	6	531	28 July 2021
Relay - Failed with result 'signal' Operate a Stake Pool	8	700	29 March 2022
My relay keep restarting every 24 hours Setup a Stake Pool	30	1460	7 May 2021
Issue starting relay node Operate a Stake Pool	30	1535	24 February 2021
Issues stopping cardano-node Operate a Stake Pool	2	703	20 February 2021

What happened ? Producer and both relays down this morning

Related topics