I missed a block today. please help me understand the logs

Sorry but I restarted it before this post

This is probably the last option.
The biggest problem is not that I lost the block, but that I do not know why :cry:

Do you have grafana? If you have, you can check those metrics there.

Thank you for the quick replies. :slight_smile:

if the producer was not aware about that block can be an error generated by cncli (this is happen when you are using wrong vrf files or perhaps cncli db corrupted) can u try to search again on cncli and check the status for that slot?

I made a VM clone to continue the research. I will post the findings (if any)

Also Kes rotating keys should not be expired, not sure this was covered in the replies.

I don’t think the problem has anything to do with cncli. The problem is most likely solely related to cardano-node and something to do with the block producer keys.

The important part in the logs is:

If you are the slot leader, the next log after should be something like:

Jan 05 04:04:00 node1 cardano-node[151911]: [node1:cardano.node.Forge:Info:174
] [2022-01-05 04:04:00.00 UTC] fromList [("val",Object (fromList [("kind",String
 "TraceNodeIsLeader"),("slot",Number 4.9778704e7)])),("credentials",String "Card
ano")]

The important part is: “TraceNodeIsLeader”

If your logs don’t say that then the node didn’t assess that it was the leader. This indicates some problem related to the leadership calculation. As I understand the leadership calculation depends on the keys, stake and nonce values. The stake and nonce values are on-chain whereas the keys are only on your system. Therefore, most likely the problem relates to your keys and whether your cardano-node loaded them.

Do the following:

  1. Determine the start time of your cardano-node with:
startsec=$(curl -s -H 'Accept: application/json' http:/localhost:12788 | jq '.cardano.node.metrics.nodeStartTime.int.val'); startdate="$(date -d @${startsec})"; echo $startdate
  1. Subtract 1-2 seconds from the printed startdate value and use this value to check your logs to see exactly what the cardano-node process logged as its configuration used. Eg:
journalctl --since "2022-01-07 08:48:24" | head -n 10
  1. Review the details of the configuration log which will start with “Node configuration:” and look something like:
Jan 07 08:48:24 node1 cardano-node[166454]: Node configuration: NodeConfiguration {ncNodeIPv4Addr = Just 0.0.0.0, ncNodeIPv6Addr = Nothing, ncNodePortNumber = Just 3000, ncConfigFile = "/opt/cardano/etc/mainnet-config.json", ncTopologyFile = "/opt/cardano/etc/mainnet-topology.json", ncDatabaseFile = "/opt/cardano/db", ncProtocolFiles = ProtocolFilepaths {byronCertFile = Nothing, byronKeyFile = Nothing, shelleyKESFile = Just "/opt/cardano/priv/kes.skey", shelleyVRFFile = Just "/opt/cardano/priv/vrf.skey", shelleyCertFile = Just "/opt/cardano/priv/node.cert", shelleyBulkCredsFile = Nothing}, ncValidateDB = False, ncShutdownIPC = Nothing, ncShutdownOnSlotSynced = NoMaxSlotNo, ... snip ...

The important bit related to your keys is:

shelleyKESFile = Just "/opt/cardano/priv/kes.skey", shelleyVRFFile = Just "/opt/cardano/priv/vrf.skey", shelleyCertFile = Just "/opt/cardano/priv/node.cert"

Note: that these details are for my block producer’s keys which I have stored in “/opt/cardano/priv/”

  1. Check that your keys are stored where your node is looking for them and that they are the correct keys.

Good luck.

1 Like

Hello, I run a small pool (600kADA) and had a block in the leaderlog which I didn’t mint. According to my research, it was my first orphaned block :neutral_face: which is a thing not avoidable by me as the SPO, but I’m not entirely sure, so I’m asking some of you experts here to help me understand it.
I had in my leaderlog:
Slot: 74388345, UTC Time: 2022-10-16 21:10:36
GLiveView showed Leader: 1 and Adopted: 1 short after this time so I already thought I minted this block. I had zero missed slots hours before and after this slot number - so it couldn’t be due to missed slots.

Then I’ve ran VRF hash check as suggested in this thread and made sure it matched the “Vrf Hash” displayed on cardanoscan.io for my pool, and that vrf.skey file was also the file i used for the node startup, which was all good:
cardano-cli key verification-key --signing-key-file vrf.skey --verification-key-file /tmp/vrf.vkey
cardano-cli node key-hash-VRF --verification-key-file /tmp/vrf.vkey

Checking the logs for this slot (cat /var/log/syslog | grep 74388345), I found that for this slot I had a “Switched to a fork”:
Oct 16 23:10:36 SERVERNAME cardano-node[13707]: #033[35m[SERVERNAME:cardano.node.ChainDB:Notice:226]#033[0m [2022-10-16 21:10:36.11 UTC] Chain extended, new tip: 6c5b8d901cf9263981537d9b1396e8684b929a908d6909a2007b9975f1b685b8 at slot 74388345
Oct 16 23:10:36 SERVERNAME cardano-node[13707]: #033[35m[SERVERNAME:cardano.node.ChainDB:Notice:226]#033[0m [2022-10-16 21:10:36.57 UTC] Switched to a fork, new tip: 17dadc10f9c3dc51622352eb45e17d1206a81f10045c54bb6dd498fb28a6878a at slot 74388345

Is this already the explanation that this block was “orphaned”? Since I have “Switched to a fork” for 1.4% of the slots, is there an approx. 1.4% chance to have a block missed because of another pool having the same slot allocated? So with my total 6 blocks, I already hit the 1.4%?
cat /var/log/syslog | grep “Switched to a fork” | wc -l # returns 57
cat /var/log/syslog | grep “Chain extended” | wc -l # returns 4049 → 57/4049=1.5%
(unfortunately I don’t have other relevant stuff in the logs for this block, maybe someone can suggest good logging options for the block producer?)

Next thing I thought, was to see if this correlates to a slot battle. To see this, what is the correct way for this? Go to cexplorer.io, enter slot ID, find “epoch slot” id, enter this on pooltool.io to find the slot battle image?

Thanks a lot for help/hints!

It was a slot battle. Two pools were slot leader for the same slot. Unfortunately your pool lost the “slot battle”. In the past your pool would usually win such a slot battle because it is a very small pool and its leader VRF value would generally be lower than the other pool’s. This is because slot battles used to be determined based on leader VRF value. However, unfortunately the code was changed recently to make slot battles totally random. Slot battles are now determined by the entire block VRF value.

See here for more explanation: Leader VRF value no longer settling ties ¡ Issue #4051 ¡ input-output-hk/ouroboros-network ¡ GitHub

1 Like

Interesting to see different approaches to debugging a missed block. I am attempting to document such things in one thread

If your VRF is correct then it’s probable that the issue is the calculation for slot was incorrect. I had this once before.

The solution was to fix a corrupted cncli db

I don’t think there is much to debug. The logs tell the story. Both blocks were for the same slot 74388345. The second block arrived 460ms after the first at 21:10:36.570 UTC and the node immediately preferred this second block since it switched its tip to this block. So this is not a height battle resolved according to longest chain but rather a simple slot battle. As far as I am aware, the only reason your own node would do this (to immediately prefer the block that it had not produced for the same slot) is if the second block had a lower “block” VRF score.

2 Likes

Thanks for the answers!

What are recommended logging options for the block producer to debug produced blocks and not harm performance?
what are you using as minSeverity?

Note: Lowering the minimal severity harms efficiency , so while using a low severity can be very helpful for debugging a problem, it should be used with care during normal operation.

(https://github.com/input-output-hk/cardano-node/blob/master/doc/logging-monitoring/logging.md)

I just have the default settings in my mainnet-config.json:

  "TracingVerbosity": "NormalVerbosity",
  "minSeverity": "Info",

What you had in your journal logs was already enough to see why your block got orphaned.

It is unfortunate. Your pool would have most likely won that slot battle prior to the 1.35.x software release because SECUR is a much bigger, and multi-pool, operator with 5 separate pools. I really miss the times when small pools were almost guaranteed to win such battles.

1 Like