Lost block in Height Battle propagation problem?

anijland · 11 November 2022 13:05

Last night at slotnumber 76559432 at 2022-11-11 00:15:23 UTC we successfully forged a block. In the end, the block was lost in a height battle. I’m just trying to understand, or there is something that I can do about it / to improve.

Nov 11 01:15:23 **01 cardano-node[889]: [**:cardano.node.LeadershipCheck:Info:329] [2022-11-11 00:15:23.00 UTC] {"chainDensity":4.82038e-2,"credentials":"Cardano","delegMapSize":1226925,"
kind":"TraceStartLeadershipCheck","slot":76559432,"utxoSize":8906957}
Nov 11 01:15:23 **01 cardano-node[889]: [**:cardano.node.Forge:Info:329] [2022-11-11 00:15:23.04 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "T
raceNodeIsLeader"),("slot",Number 7.6559432e7)]))]
Nov 11 01:15:23 **01 cardano-node[889]: [**:cardano.node.Forge:Info:329] [2022-11-11 00:15:23.04 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("block",String "
bf3ed56d22f5a09efc6515cf6599dbea79ccf29f329ac5bb3b26d08c0cda9429"),("blockNo",Number 7998321.0),("blockPrev",String "1d740ced4a2e44d7e97e7f8e57f83b97600a50b60ad70d73f2a10d0c072664ac"),("kind",String "Trac
eForgedBlock"),("slot",Number 7.6559432e7)]))]
Nov 11 01:15:23 **01 cardano-node[889]: [**:cardano.node.ChainDB:Notice:319] [2022-11-11 00:15:23.04 UTC] Chain extended, new tip: bf3ed56d22f5a09efc6515cf6599dbea79ccf29f329ac5bb3b26d08c
0cda9429 at slot 76559432
Nov 11 01:15:23 **01 cardano-node[889]: [**:cardano.node.Forge:Info:329] [2022-11-11 00:15:23.04 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("blockHash",Stri
ng "bf3ed56d22f5a09efc6515cf6599dbea79ccf29f329ac5bb3b26d08c0cda9429"),("blockSize",Number 871.0),("kind",String "TraceAdoptedBlock"),("slot",Number 7.6559432e7)]))]

I also looked in the logs of my relays, I couldn’t find a trace so far of my BP producing the block and that it was fetched by the relay. Is that supposed to be logged?

The strange thing is that there has only been 1 node reporting my block. Is this normal? I do have incoming connections on my relays (around 30) and also 14.787 served blocks in the last 18 hours. My block producer node also had 2 incoming connections from my relays at the time of producing the block.

The block that in the end has won the height battle has been produced at 00:15:42 which is much later. Block 7998321 - Cardanoscan

What could be the reason that “Nodes reporting” is equal to 1 for the block that I produced?

I have in my Grafana dashboard a chart with connected peers, but no incoming connections chart. Have been looking for it in the available statistics to display it, but couldn’t find it. Anyone knows or it is possible to display this?

Hopefully somebody will be able to help me to get this cleared up, thanks!

Terminada · 11 November 2022 14:02

Here is what I can tell:

Your logs show that your block producer did produce your block for slot 76559432 with block hash bf3ed56d which was built upon previous block hash 1d740ced. Your block producer accepted it as a valid block and extended its tip with it.

On my own relay, I see no record of receiving your block. My relay logs show the following:

[relay1:cardano.node.ChainDB:Notice:207] [2022-11-11 00:14:49.62 UTC] Chain extended, new tip: 1d740ced4a2e44d7e97e7f8e57f83b97600a50b60ad70d73f2a10d0c072664ac at slot 76559398
[relay1:cardano.node.BlockFetchDecision:Info:217] [2022-11-11 00:15:42.52 UTC] [TraceLabelPeer (ConnectionId {localAddress = 192.168.27.7:2700, remoteAddress = 54.238.92.29:3001}) (Right [At (Block {blockPoin>
[relay1:cardano.node.BlockFetchDecision:Info:217] [2022-11-11 00:15:42.97 UTC] before next, messages elided = 8
[relay1:cardano.node.BlockFetchDecision:Info:217] [2022-11-11 00:15:42.97 UTC] [TraceLabelPeer (ConnectionId {localAddress = 192.168.27.7:2700, remoteAddress = 5.161.70.137:5001}) (Left (FetchDeclineConcurren>
[relay1:cardano.node.ChainDB:Info:207] [2022-11-11 00:15:43.01 UTC] Valid candidate 06c5f327bb372c7cbdca37041fa0e1f66fa2beb97b02b2d33fc31d0f175e384e at slot 76559451
[relay1:cardano.node.ChainDB:Notice:207] [2022-11-11 00:15:43.01 UTC] Chain extended, new tip: 06c5f327bb372c7cbdca37041fa0e1f66fa2beb97b02b2d33fc31d0f175e384e at slot 76559451

Note that my relay received block for slot 76559398 and then the next block was for slot 76559451.

My cncli database shows the following blocks between slots 76559398 and 76559506:

sqlite3 /var/lib/cardano/cncli.db "select id, block_number, slot_number, pool_id, substr(hash,1,8), substr(prev_hash,1,8), substr(block_vrf_0,1,8), orphaned from chain where slot_number between 76559398 and 76559506;"

ID | blockNo | slotNo, poolId | blockHash | prevHash | blockVrf | orphaned
3518463|7998320|76559398|7d7ac07a2f2a25b7a4db868a40720621c4939cf6aefbb9a11464f1a6|1d740ced|12af1641|d8da1377|0
3518464|7998321|76559451|f6541f3ca941d96d3c243eae01fea874d6be403b2dd2c6575245de6e|06c5f327|1d740ced|ba0b451e|0
3518465|7998322|76559506|7e00129d906afbba06a6b279fcf6e4295986a1104b7f5c4046c653bc|e11a0192|06c5f327|bb739482|0

Again this shows block for slot 76559398 and then slot 76559451.

Note that block 06c5f327 (slot 76559451) was built upon previous hash 1d740ced which was the same previous hash that your block built upon. Assuming that this block producer was running an unmodified cardano-node then it clearly didn’t receive your block for slot 76559432 before it built its own block.

Your pooltool “Height battle” picture indicates that only one node reported your block to pooltool. Presumably this was your own block producer.

In other words, it seems that the rest of the Cardano network of relays did not receive your block for slot 76559432. Therefore you likely have a connection problem somewhere in the chain between your block producer to your relays to the rest of the Cardano network.

Look in the logs of your relay (like I did on mine) to see if your own relay received your block from your block producer.

Then also check your relay incoming connections using the ss tool. Eg:

ss -4Htnp state established | grep -Po '[0-9\.]+:3001\s+[0-9\.:]+'

This will give you all the TCP connections that are established to your cardano-node process (on port 3001 or change to your value).

If you are running in P2P mode you can also see how many incoming connections you have with:

curl -s -H 'Accept: application/json' http:/localhost:12788 | jq '.cardano.node.metrics.connectionManager'

and get output like:

{
  "duplexConns": {
    "type": "g",
    "val": 3
  },
  "incomingConns": {
    "type": "g",
    "val": 15
  },
  "outgoingConns": {
    "type": "g",
    "val": 50
  },
  "prunableConns": {
    "type": "g",
    "val": 1
  },
  "unidirectionalConns": {
    "type": "g",
    "val": 61
  }
}

Hope that helps with your troubleshooting. Good luck.

Terminada · 11 November 2022 14:10

Also, here is a post I wrote recently about troubleshooting a slot or height battle: Troubleshooting why my block got orphaned

georgem1976 · 13 November 2022 06:59

If you are sure that your BP has In connections from your relays, and your relays have In connections from other relays in the network, then most likely the time is not in sync on your nodes. Are you running chrony (or another ntp daemon, but chrony seems to be the best) on ALL your nodes?
If not, do it now. If yes, check if it is working correctly (run chronyc tracking on all nodes and try to understand what the result means).

anijland · 14 November 2022 05:05

Thanks for full en detailed response!

I’m monitoring my setup with Grafana and Prometheus, for the peers I’m using the statistic ‘cardano_node_metrics_connectedPeers_int’. During the scheduled block according to the statistic exported from Cardano all should be connected. See below: (times are in UTC)

Strange thing is though that I do not see any record of the block arriving at one of my relays, so there must be something wrong.

Also running Chrony and monitoring the time offsets with Grafana. During the specified time also no strange offsets detected.

Screenshot 2022-11-13 at 19.15.312730×696 147 KB

Hmm in the meantime with some more research I noticed that unfortunately my block producer ip disappeared from the topology files on my relays. A mistake I made while refactoring the relay topology pull script Have added it back now so all should be working again!

Anyone knows how to monitor connections out with Grafana and Prometheus? As the peers are not displaying the connections out and the protocol works on a pull basis, it is also really important to know.

Again thanks for the help!

Terminada · 14 November 2022 12:35

You can possibly create some eye-candy using Grafana and Prometheus but the underlying data will likely need to come from using the ss tool. If you look at the code in gLiveView it also uses the ss tool. You don’t need the eye-candy and using it is arguably detrimental to your understanding in the long run.

I recommend just using the ss tool directly yourself for a better understanding.

Start with this:

Here I have assumed that your node is listening on port 3001. Any incoming connections from other relays will connect to your node port (3001).

Here is the output of the above command (using port 2700) on my block producer since it is listening on port 2700 instead of 3001.

192.168.27.9:2700  144.126.157.46:41729
192.168.27.9:2700  144.126.157.46:2700
192.168.27.9:2700    192.168.28.1:2700
192.168.27.9:2700    192.168.28.1:43161
127.0.0.1:2700       127.0.0.1:46936
192.168.27.9:2700    192.168.27.7:2700
192.168.27.9:2700    192.168.27.8:2700

These are all the established TCP connections to my cardano-node on port 2700. The connection to 127.0.0.1:2700 is cncli running on the same machine. All the other connections are to/from my relays.

Note that my block producer is running in P2P mode.

If we just consider the 2 connections to/from IP address 144.126.157.46. The connection on port 41729 is an incoming connection which was set up from 144.126.157.46 (relay3.terminada.io) in to my block producer. Whereas the connection on port 2700 is an outgoing connection set up by my block producer out to relay3.

How do I know this? I know this because my relay3 is running in normal mode so it sets up connections from random high ports. Whereas my block producer is running in P2P mode so all its connections are set up using port 2700.

If you are running normal mode then it will be easier for you to figure out which way the connections were set up. This is because all nodes running in normal mode set up their connections using random high ports. Therefore running this ss command on your block producer (assuming it listens on port 3001):

ss -4Htnp state established | grep -Po '[0-9\.]+:3001\s+[0-9\.:]+'

Will give you all the incoming connections set up by your relays into your block producer. You should see output like this:

bpIP:3001  relay1IP:highport
bpIP:3001  relay2IP:highport

To see the TCP connections that were set up by your block producer out to your relays then you need to look for the TCP connections that go to your relay listening ports. Again assuming all your relays are listening on port 3001 then you could use a command like this:

ss -4Htnp state established | grep -Po '[0-9\.]+:\d+\s+[0-9\.]+:3001'

Or you can just run the ss command without any grep to filter:

ss -4Htnp state established

Which on my block producer results in:

0         0            192.168.27.9:2700         144.126.157.46:41729     users:(("cardano-node",pid=683451,fd=42))                 
0         0            192.168.27.9:2700         144.126.157.46:2700      users:(("cardano-node",pid=683451,fd=45))                 
0         0            192.168.27.9:2700           192.168.28.1:2700      users:(("cardano-node",pid=683451,fd=63))                 
0         0            192.168.27.9:22              172.16.0.45:34088     users:(("sshd",pid=724219,fd=4))                          
0         0            192.168.27.9:2700           192.168.28.1:43161     users:(("cardano-node",pid=683451,fd=38))                 
0         0               127.0.0.1:46936             127.0.0.1:2700      users:(("cncli",pid=474,fd=10),("cncli",pid=474,fd=9))    
0         0               127.0.0.1:2700              127.0.0.1:46936     users:(("cardano-node",pid=683451,fd=41))                 
0         0            192.168.27.9:12798           172.16.0.45:42146     users:(("cardano-node",pid=683451,fd=28))                 
0         0            192.168.27.9:2700           192.168.27.7:2700      users:(("cardano-node",pid=683451,fd=88))                 
0         0            192.168.27.9:2700           192.168.27.8:2700      users:(("cardano-node",pid=683451,fd=52))                 
0         0            192.168.27.9:22              172.16.0.45:57854     users:(("sshd",pid=494,fd=4))

Which then also gives you details about the processes.

I recommend experimenting with the ss tool and taking a look at its man page. You will be glad you did because then you will have a much better understanding. Note that the ss tool doesn’t tell you which way the TCP connection was set up. This is why you need to apply reasoning by looking at the ports involved to figure out how the connection must have been set up.

The eye-candy tools like gLiveView are just using this same ss tool.

Now that you understand better how things work, you can create your own script that runs every minute via a cron job. This script can grab data using the ss tool and export this information using something like prometheus node exporter so that you can have your eye-candy in Grafana.

What I do personally is run a command like this:

printf "%0.s-" {1..70}; \
echo; startsec=$(curl -s -H 'Accept: application/json' http://localhost:12788 | jq '.cardano.node.metrics.nodeStartTime.int.val'); \
startdate="$(date -d @${startsec})"; nowsec=$(date '+%s'); nowdate="$(date)"; runhrs=$(( (nowsec - startsec) / 3600 )); runmins=$(( (nowsec - startsec) % 3600 / 60 )); \
rtsconf="$(ps aux | grep -Po "cardano-node\s.*\+RTS\s.*\-RTS")"; \
missedslots=$(curl -s -H 'Accept: application/json' http:/localhost:12788 | jq '.cardano.node.metrics.slotsMissedNum.int.val'); \
echo "Node Started: ${startdate} (Running: ${runhrs} hrs ${runmins} mins)"; \
echo "RTS settings: ${rtsconf}"; \
echo "Missed slots: ${missedslots}"; \
echo "Memory use:"; free; \
echo; ss -4Htnp state established | grep -Po '192\.168\.27\.9:2700\s+(192\.168\.\d+\.\d+|144\.126\.157\.46):\d+'

The user bash history then has this command saved and I just hit the up arrow to scroll back through my history to repeat this same command as needed. Or you can just save it as a document note somewhere for future copy - paste ability. If you read through the above mini script you can see what it is doing. Then modify it for your needs. But note that my bp is running in P2P mode which provides more useful data via cardano.node.metrics. Experiment with the curl command then pull out the data you want using jq.

The real take-home message, for anyone reading, is to find your own way. Tear apart other people’s scripts and steal ideas. Ensure you understand how things are working. Then build your own scripts to produce the data you want. Don’t trust the scripts created by others until you understand what they do.

Beware that some eye-candy tools can inadvertently open up security holes in your system if you are not careful.

Topic		Replies	Views
How to figure out when pool wins slot battles or causes height battles? Operate a Stake Pool	16	2351	7 January 2022
Understand why i lose a block Operate a Stake Pool	1	272	1 February 2024
Slot battles: what do they look like & what can be done about them? Operate a Stake Pool ouroboros , stake-pools	7	3565	10 January 2021
Adopted block is not shown on the chain - why? Operate a Stake Pool	1	404	1 August 2022
cardano.node.Forge:Error Operate a Stake Pool	11	2206	9 November 2022

Lost block in Height Battle propagation problem?

Also running Chrony and monitoring the time offsets with Grafana. During the specified time also no strange offsets detected. Screenshot 2022-11-13 at 19.15.312730×696 147 KB

Related topics

Also running Chrony and monitoring the time offsets with Grafana. During the specified time also no strange offsets detected.

Screenshot 2022-11-13 at 19.15.312730×696 147 KB