Missed block reported but no "TraceNodeIsLeader" in log

MariaSPO · 12 December 2021 18:04

Hey fellow Cardano enthusiasts,

something that I don’t understand happened in my stake pool this past epoch 307 and I’d be grateful if someone helps me to understand and/or diagnose.

According to gLiveView I had a leader slot assigned, both my relays and producer were available not only at that slot time, but during the entire epoch, and still, around a minute after the slot, for my surprise gLiveView increased the missed counter.

I did grep the log file for “IsLeader” and no match was found, I manually did search the slot in the log file, and there was two entries for it, one with “TraceStartLeadershipCheck” and another with “TraceNodeNotLeader”.

I did check then with cncli sync + leaderlog (both with -db /opt/cardano/cnode/guild-db/cncli/cncli.db), and it effectively did return that I was leader for that same slot.

Then I tried the exact same check with cncli, but without specifying a db in both commands as the only difference, so a new db was built on the current directory. In this case, the output stated that I wasn’t leader for any slot during this same epoch 307.

Is there any explanation for this? anyone can shed some light?

My setup:
VM with 16GB RAM and 6Core
cardano-node 1.31.0 - linux-x86_64 - ghc-8.10 git rev 2cbe363874d0261bc62f52185cf23ed492cf4859
cncli 4.0.1
gLiveView 1.24.0

Thanks in advance!

Alexd1985 · 12 December 2021 18:39

go to cntools → blocks → summary → press enter … what is the log? (ghosted, stolen, etc)

MariaSPO · 12 December 2021 18:44

It is accounted as “Missed” in cntools / gLiveView

Alexd1985 · 12 December 2021 20:04

And did u checked the logs, did u find something inside?

MariaSPO · 12 December 2021 20:42

Yes, there isn’t any obvious error in the logs, but as I wrote in my original post, after locating the slot in the log, I see a “TraceNodeNotLeader” for it, whereas as I understand, when you are the leader for an slot, the log should read “TraceNodeIsLeader”, followed by other messages related to the block processing.

As It seems that this log is written directly by the node, I could conclude that the node did not consider itself leader for that slot. Then the question is, why cncli and gLiveView (I guess that from the same source) calculated that this same producer was going to be the leader for the slot… how reliable is this calculation?

Also, next thing that I don’t understand is, how rebuilding the database in another location, with the same parameters, produces a different output (without the leader assignment for that same slot)? Is the original database that is synced in background, maybe corrupted?

How can cntools consider that a slot is missed without a “TraceNodeIsLeader” reading, I’m not sure if this situation can represent a valid use case or maybe just cntools lacking some validations.

Thanks for the interest,
M.

Alexd1985 · 12 December 2021 21:19

could be an issue if u are using a wrong VRF file… do u remember if u changed recently anything on the server?

MariaSPO · 12 December 2021 22:05

I added a relay in the topology file of the BP 8 hours before the slot, but as I understand this is completely unrelated.

M.

Alexd1985 · 12 December 2021 22:11

then the only explication is that the CNCLI provided wrong informations about the slot leader

MariaSPO · 12 December 2021 22:50

ok, and how is this possible?

I see the output of cncli leaderlog depends on:

pool id string
vrf-skey file
db file (from cncli sync)
byron genesis file
shelley genesis file
pool stake (from stake snapshot using cardano-cli)
active stake (from stake snapshot using cardano-cli)

If the result is different when just changing the db (already existent vs just synced), means those db are different. If I understand correctly they must be equal when equally up to date. I’ll try to dump and diff…

In any case, is it safe to delete /opt/cardano/cnode/guild-db/cncli/cncli.db ? I guess it’ll be recreated by the syncing process right?

I haven’t rotated the vrf key as it doesn’t expire until January, do you think I should do it just in case it has something to do?

Cheers!
M.

Alexd1985 · 12 December 2021 23:04

Yes, but u can rename the file just in case… us mv cncli.db cncli.db_bkp

MariaSPO · 12 December 2021 23:52

I found another similar thread (you were there too, nice!):

I will try renaming the db file (I’ll keep the old one for further analysis).

In any case, nonces are different when querying both databases:

the already existent: “epoch”: 307, “epochNonce”: “c40ada27c0b0e4fb706bac418ba02828813d5e415de809a277c06a0a905e7ac1”,
the newly synced: “epoch”: 307, “epochNonce”: “c9df45e52ba90cd0fd84a2150ab3764e36de44193f98fafda38c79f90e825a7b”

I think that they are calculated hashing some blocks from the previous epoch, so that would mean that the data of one of the databases is definitely corrupted, probably the existent one. I am wondering how cncli can produce an apparently valid result from a corrupted or incomplete database.

Could someone confirm the correct nonce for epoch 307?

Cheers!

M.

Alexd1985 · 13 December 2021 02:47

But, run the sccript ./cncli.sh from scripts folder

georgem1976 · 16 December 2021 06:21

One common cause for missing blocks is using a wrong VRF key.

You wrote “I haven’t rotated the vrf key as it doesn’t expire until January”. This is inaccurate. The VRF key never expires. Did you want to actually write “KES” instead of VRF?

You should check if the VRF key that you are currently using is the correct one (the one you registered the stake pool with) by comparing the hash of the VRF verification key of the VRF signing key set in your Block Producer’s startup file and the hash of the registered one on a blockchain explorer.
To generate the hash locally:
cardano-cli node key-hash-VRF --verification-key-file vrf.vkey --out-file vrf.hash

I saw a few cases of SPOs changing the VRF keys when rotating the KES keys and operational certificate, and this causes exactly the problem you are experimenting.

If this is the case (wrong VRF keys), search for a backup of the VRF keys that you used when you registered the stake pool, and compare the hash to be 100% sure. If you don’t find it in a backup, then you should register the stake pool again with the VRF key that you currently have, but there will be a delay of 2 epochs (I am not 100% sure if exactly 2 epochs) until the new one becomes active, and backup the VRF keys.

I hope you will be able resolve the problem.

MariaSPO · 18 December 2021 14:21

Thanks for the clarification, you’re completely right.

In this case the error was not the VRF key because the same one was used to successfully produce a block before. Also, there was an evident corruption of the cncli database because the cncli nonce command was returning different values for the db that was synced in background and a newly synced one in another directory.

I deleted the db at /opt/cardano/cnode/guild-db/cncli/cncli.db and it got recreated by the background sync process. Now both db returns the same nonce, so hopefully the problem is solved, we’ll check in the next epochs.

Thanks to all who helped!
M.

Topic		Replies	Views
Missing blocks but unsure why Operate a Stake Pool	44	2535	24 November 2021
I missed a block today. please help me understand the logs Operate a Stake Pool	33	2670	27 October 2022
Missed two blocks - no clue why Operate a Stake Pool	9	967	18 April 2021
Need help identifying why I am missing blocks Operate a Stake Pool	85	2657	7 February 2022
Missed block Operate a Stake Pool	22	1096	14 March 2022

Missed block reported but no "TraceNodeIsLeader" in log

Related topics