Producer node loses connection to relay

  1. producer node loses connection and throws errors (“ErrorPolicySuspendConsumer”, “closed socket to”, “desired valency 1 current valency 0”, etc; see log below)
  2. when i restart the node it reconnects successfully and works as expected for a while
  3. then (hours later) it loses connection again

what could it be? how to configure nodes so they keep reconnecting automatically?

both nodes are running cardano-node version 1.26.2

producer node logs:

May 13 01:21:37 XXX cardano: \^[[35m[XXX:cardano.node.ChainDB:Notice:34]\^[[0m [2021-05-13 01:21:37.18 UTC] Chain extended, new tip: 089267708a29fc4e4259c71317e8957fe556678602ebad2b143b38fb47c6f06a at slot 29302606
May 13 01:22:44 XXX cardano: \^[[31m[XXX:cardano.node.IpSubscription:Error:57]\^[[0m [2021-05-13 01:22:44.62 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Application Exception: XXX.XXX.XXX.XXX:XXXXX ExceededTimeLimit (KeepAlive) (ServerAgency TokServer)
May 13 01:22:44 XXX cardano: \^[[34m[XXX:cardano.node.IpSubscription:Info:57]\^[[0m [2021-05-13 01:22:44.62 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Closed socket to XXX.XXX.XXX.XXX:XXXXX
May 13 01:22:44 XXX cardano: \^[[35m[XXX:cardano.node.ErrorPolicy:Notice:50]\^[[0m [2021-05-13 01:22:44.63 UTC] IP XXX.XXX.XXX.XXX:XXXXX ErrorPolicySuspendConsumer (Just (ApplicationExceptionTrace ExceededTimeLimit (KeepAlive) (ServerAgency TokServer))) 20s
May 13 01:22:45 XXX cardano: \^[[34m[XXX:cardano.node.IpSubscription:Info:52]\^[[0m [2021-05-13 01:22:45.64 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Restarting Subscription after 41091.916323806s desired valency 1 current valency 0
May 13 01:22:45 XXX cardano: \^[[34m[XXX:cardano.node.IpSubscription:Info:52]\^[[0m [2021-05-13 01:22:45.64 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Trying to connect to XXX.XXX.XXX.XXX:XXXXX
May 13 01:22:45 XXX cardano: \^[[34m[XXX:cardano.node.IpSubscription:Info:78]\^[[0m [2021-05-13 01:22:45.64 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Connection Attempt Start, destination XXX.XXX.XXX.XXX:XXXXX
May 13 01:22:45 XXX cardano: \^[[35m[XXX:cardano.node.IpSubscription:Notice:52]\^[[0m [2021-05-13 01:22:45.64 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Waiting 0.025s before attempting a new connection
May 13 01:24:00 XXX cardano: \^[[34m[XXX:cardano.node.IpSubscription:Info:78]\^[[0m [2021-05-13 01:24:00.64 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Closed socket to XXX.XXX.XXX.XXX:XXXXX

mainnet-topology.json on producer:

{
  "Producers": [
    {
      "addr": "XXX.XXX.XXX.XXX",
      "port": XXXXX,
      "valency": 1
    }
  ]
}

mainnet-topology.json on relay:

{
  "Producers": [
    {
      "addr": "YYY.YYY.YYY.YYY",
      "port": YYYYY,
      "valency": 1
    },
    {
      "addr": "relays-new.cardano-mainnet.iohk.io",
      "port": 3001,
      "valency": 2
    }
  ]
}

What is the hw configuration of the servers?
u must use topologyupdater for the relay (out of topic)

@Alexd1985, thanks for prompt response!

both nodes are the same vms: 4 core cpu with 8192 mb ram

(noted about topologyupdater. thanks)

try journal -e -f -u cardano-node.service

oops, forgot to mention it’s bsd. no systemd. sorry.

what should i be looking for in the logs?

strange, try when it happens if u can telnet from the producer to the relay

from producer:

telnet Relay_IP Relay_cnode_port

try to run the topology updater on Relay and check after… if this issue still exist

connection looks good:

# telnet XXX.XXX.XXX.XXX XXXXX
Trying XXX.XXX.XXX.XXX...
Connected to XXX.XXX.XXX.XXX.
Escape character is '^]'.
Connection closed by foreign host.
#

producer node log after restart looks good as well (“ConnectSuccessLast”):

May 13 04:11:16 XXX cardano: \^[[34m[XXX:cardano.node.ChainDB:Info:5]\^[[0m [2021-05-13 04:11:16.60 UTC] Opened lgr db
May 13 04:11:33 XXX cardano: \^[[34m[XXX:cardano.node.ChainDB:Info:5]\^[[0m [2021-05-13 04:11:33.17 UTC] Opened db with immutable tip at dac67fe42995dd85f10b660631f4b3d0fc145ba6ae9d4597b9bab789113f2ee4 at slot 29258724 and tip 089267708a29fc4e4259c71317e8957fe556678602ebad2b143b38fb47c6f06a at slot 29302606
May 13 04:11:33 XXX cardano: \^[[34m[XXX:cardano.node.IpSubscription:Info:52]\^[[0m [2021-05-13 04:11:33.17 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Trying to connect to XXX.XXX.XXX.XXX:XXXXX
May 13 04:11:33 XXX cardano: \^[[34m[XXX:cardano.node.IpSubscription:Info:55]\^[[0m [2021-05-13 04:11:33.17 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Connection Attempt Start, destination XXX.XXX.XXX.XXX:XXXXX
May 13 04:11:33 XXX cardano: \^[[35m[XXX:cardano.node.IpSubscription:Notice:52]\^[[0m [2021-05-13 04:11:33.17 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Waiting 0.025s before attempting a new connection
May 13 04:11:33 XXX cardano: \^[[35m[XXX:cardano.node.IpSubscription:Notice:55]\^[[0m [2021-05-13 04:11:33.18 UTC] IPs: 0.0.0.0:0 [XXX.XXX.XXX.XXX:XXXXX] Connection Attempt End, destination XXX.XXX.XXX.XXX:XXXXX outcome: ConnectSuccessLast

will try topology updater on relay a bit later. anything else i can check before that?

@Alexd1985, thanks again for such quick replies! appreciate your help.

this was tested when the connection dropped?

no, i can’t catch that moment when connection is lost. that was tested now :slight_smile:

but u said the connection drop and u need to restart the producer in order to reconnect right?
before to restart next time, check the connection

I don’t have any ideea why the producer loss the connectivity… try to add the IOHK relays also in topology files and check if u are losing the connectivity only with your Relay

and one question: are u losing the IN or OUT connection?

correct. i need to restart producer in order to reconnect. before i restart i did check with telnet and it was connecting okay.

i guess that network goes down and then up again somewhere between producer and relay, but producer can’t reconnect automatically.

let me check relay logs as well and see. thanks!