Slot battles: what do they look like & what can be done about them?

COSDpool · 15 October 2020 08:19

This is not a tutorial but I hope it would be a good exercise for those who’ve analysed these things before (with my gratitude for that help), and a learning experience for SPOs who worry about the same thing.

For all the documentation effort, this may simply be a run-of-the-mill illustration of the “slot battle” I’ve heard of but never experienced yet (that I know of… we might have been on the winning side before without realising it).

We’ve just “lost” our first block and I’m trying figure out what happened. The long block hex IDs may make this difficult to read but I wanted these to be readable in blockchain explorers. (This post is not all log messages.)

First, our core node shows we’ve produced a block (89*) in exactly the slot expected in our leadership schedule, and that it extends our copy of the chain with the new block:

[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.00 UTC] Leading slot 11171885

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []

[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.01 UTC] Forged block in slot 11171885

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList [("kind",String "OutcomeTraceForgeEvent"),("difftime",Number 5.74117e8),("slot",Number 1.1171885e7),("mempoolbytes",Number 0.0),("mempoolnumtx",Number 0.0)]

[core-nyc:cardano.node.ChainDB:Notice:35] [2020-10-15 05:02:56.01 UTC] Chain extended, new tip: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885

[core-nyc:cardano.node:Info:45] [2020-10-15 05:02:56.01 UTC] fromList [("time(ps)",Number 1.788674423921033e18),("kind",String "MeasureTxsTimeStop"),("slot",Number 1.1171885e7)]

[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.01 UTC] Adopted block forged in slot 11171885: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4, TxIds: []

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []

… but then when the next block/slot comes along (block 56*) ~20 seconds later, at the time of taking on that block it switches to another fork… in light of the messages seen below on our relay, rejecting our own block in favour of a block we didn’t produce (9d*) when taking that fork:

[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.53 UTC] Block fits onto some fork: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.53 UTC] Block fits onto some fork: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903

[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.54 UTC] Valid candidate 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903

[core-nyc:cardano.node.ChainDB:Notice:35] [2020-10-15 05:03:14.54 UTC] Switched to a fork, new tip: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 1117190

On our local relay, with a gigabit connection to the block producer, we can see that it drops our block (89*) in favour of the “other guy’s” block (9d*) about 100 milliseconds after it receives the block from our own core:

[relay-ny:cardano.node.ChainDB:Notice:43] [2020-10-15 05:02:56.02 UTC] Chain extended, new tip: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885

[relay-ny:cardano.node.ChainDB:Info:43] [2020-10-15 05:02:56.13 UTC] Block fits onto some fork: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[relay-ny:cardano.node.ChainDB:Info:43] [2020-10-15 05:02:56.14 UTC] Valid candidate 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[relay-ny:cardano.node.ChainDB:Notice:43] [2020-10-15 05:02:56.14 UTC] Switched to a fork, new tip: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

At this point I guess our relay, and maybe lots of other nodes out there, within a fraction of a second have two blocks produced in the same slot.

What I don’t understand: how our own relay, already having confirmation that our own block is genuine, can immediately fork in favour of the “other guy’s” block.

Meanwhile, about 100 milliseconds later (apparently: see time sync question below), on our “remote” relay (other the other side of the world than the core node, but great network connectivity):

[relay-sg:cardano.node.ChainDB:Notice:40] [2020-10-15 05:02:56.27 UTC] Chain extended, new tip: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[relay-sg:cardano.node.ChainDB:Info:40] [2020-10-15 05:02:56.50 UTC] Block fits onto some fork: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885

[relay-sg:cardano.node.ChainDB:Notice:40] [2020-10-15 05:03:14.27 UTC] Chain extended, new tip: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903

In this case it receives the “other guy’s” block (9d) about 200 milliseconds before the one that we’ve produced (89*), and snubs our block in favour of theirs: continuing on the same fork that our core node also selected in the next slot (tip 56*).

FYI we have far under-loaded nodes for both core & relays, with great connectivity between each other & the outside world: also confirmed by no dropped or delayed connections anywhere around the time between the two slots above.

My biggest, main question: When we produced this block on time, and according to our VRF calculation were the predetermined slot leader, how exactly did/can/might another block just push ours out of the way? Not just specifically in this case, but also in general?

I’ve tried to read the Ouroboros paper, with maybe only a layman’s understanding of it (I’m an engineer, not a scientist) and my main goal is to find any weakness in our architecture that may have made us vulnerable to this.
I remember reading in some cardano-node issue that something like “VRF alternates” were possible in mainnet, and someone mentioned on the testnet people were setting back their system time to try to “game the system” of selecting between alternates, but I can’t find that reference now.

Other questions:

systemd-timesyncd - We’re using this (instead of ntpd), which is the default on Ubuntu 20.04 servers and considered, as far as I’ve investigated, good enough for timing sensitive server applications, as long as the machines themselves don’t have to be time servers (both our core and relays are only time sync clients). Could a time discrepancy be an issue here?

There’s a Forge log message above (OutcomeTraceForgeEvent), with another message appearing to be a measurement of the transaction time, but I can’t tell whether these might indicate a time discrepancy.

pooltool.io > leaderLogs.py - Coincidentally we ran this the first time last night to determine our leadership schedule. (This is a brilliant tool and I’m very happy to have found it, with much appreciation for people digging that algorithm out of the code.) Our leadership schedule this epoch was completely consistent with our actual block production up until the time came to produce the very next block on that schedule, which was missed.

In case this is an issue we’re running the script on our core node, to avoid duplicating our VRF key file.
I am not being paranoid, and I feel safe the code is only using library routines to reproduce what Ouroboros is doing, but if anyone has any idea how this could have influenced the situation then please let us know. (No I didn’t tweet when our next block was coming out and we show no signs of DoS in any case. )

Psychomb · 15 October 2020 11:27

Values of the VRF outputs, normally used to check if you have the right to sign this block, are also used to select the winning block in case of a battle between several nodes (two or even more). It means that slots battle are won or lost based on pure luck, in a deterministic fashion : the lowest VRF output gets the block on chain.

So there is nothing to brag about winning or be desperate about loosing.

COSDpool · 15 October 2020 13:05

I’m not sure if this is the issue I originally found… but a number of related issues point here as the solution after the problem was originally pointed out earlier in April:

So then all we are still looking for is a subjective / practical understanding of how, when, why, and maybe how often the VRF will allow multiple block producers in any given slot to believe that they are the slot leader. I missed this in what might be my last attempt to read the academic papers, so a specific reference would be nice.

Obviously we have to trust that the system will allow us to win a slot battle as often as losing one, in whatever ratio assures every pool’s probability of producing a block remains proportional to its portion of the active stake. Now that we know what a slot battle looks like, we mainly need to anticipate roughly how often these will occur.

Psychomb · 15 October 2020 13:56

_ilap · 16 October 2020 01:31

It does not matter, slot battles are eliminated by the protocol by selecting smaller VRF values in a case when two blocks (for the same slot) are distributed to the network from different pool (different VRF). As @Psychomb has already mentioned above.

_ilap · 16 October 2020 01:40

And to be precise it’s not based on pool’s VRF vkey (as everybody would create smaller VRF by brute force), but the calculated leader election value in the header.

laplasz · 10 January 2021 10:35

My question would be how to enable such log entries in the log file?
I have "TraceForge": true in the config file - but have not seen such entry but the metrics showing the following:

cardano_node_metrics_nodeIsLeaderNum_int 1
cardano_node_metrics_Forge_forged_int 1
cardano_node_metrics_Forge_node_is_leader_int 1
cardano_node_metrics_blocksForgedNum_int 1
cardano_node_metrics_Forge_adopted_int 1

so I guess that the node has created a block at least. thanks!

COSDpool · 10 January 2021 11:54

We have them in SimpleView when using the default config.json, although the format changed a lot between the messages above & the release of node 1.24.2:

$ nice grep 'fromList .("credentials",String "Cardano")' node.log | nice grep -v TraceNodeNotLeader

[core-nyc:cardano.node.Forge:Info:54] [2021-01-10 09:53:12.00 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceNodeIsLeader"),("slot",Number 1.8706101e7)]))]

[core-nyc:cardano.node.Forge:Info:54] [2021-01-10 09:53:12.00 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("kind",String "TraceForgedBlock"),("slot",Number 1.8706101e7)]))]

[core-nyc:cardano.node.Forge:Info:54] [2021-01-10 09:53:12.00 UTC] fromList [("credentials",String "Cardano"),("val",Object (fromList [("blockSize",Number 1016.0),("kind",String "TraceAdoptedBlock"),("blockHash",String "e408d7e"),("slot",Number 1.8706101e7)]))]

Yes, it’s pretty dense that the slot number displays in scientific notation when the same data as the advancing tip is always displayed as an integer. I’d always thought somebody else would notice this, so only just now submitted that as a bug report:

Topic		Replies	Views
Orphaned block, three way Height Battle Operate a Stake Pool	4	419	30 October 2022
Missed Block 8.1.1 Operate a Stake Pool	5	408	23 August 2023
Not getting the Vasil node.counter memo - A STORY FOR OTHER SPOs Operate a Stake Pool	0	430	24 December 2022
Understanding gLiveView Block Producing Metrics - an explanation to a basic question sought! Operate a Stake Pool	40	3282	1 April 2021
Missed block, scheduled but minted by another SPO Operate a Stake Pool	3	307	5 February 2023

Slot battles: what do they look like & what can be done about them?

Related topics