Slot battles: what do they look like & what can be done about them?

This is not a tutorial but I hope it would be a good exercise for those who’ve analysed these things before (with my gratitude for that help), and a learning experience for SPOs who worry about the same thing.

For all the documentation effort, this may simply be a run-of-the-mill illustration of the “slot battle” I’ve heard of but never experienced yet (that I know of… we might have been on the winning side before without realising it).

We’ve just “lost” our first block and I’m trying figure out what happened. The long block hex IDs may make this difficult to read but I wanted these to be readable in blockchain explorers. (This post is not all log messages.)

First, our core node shows we’ve produced a block (89*) in exactly the slot expected in our leadership schedule, and that it extends our copy of the chain with the new block:

[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.00 UTC] Leading slot 11171885

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []

[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.01 UTC] Forged block in slot 11171885

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList [("kind",String "OutcomeTraceForgeEvent"),("difftime",Number 5.74117e8),("slot",Number 1.1171885e7),("mempoolbytes",Number 0.0),("mempoolnumtx",Number 0.0)]

[core-nyc:cardano.node.ChainDB:Notice:35] [2020-10-15 05:02:56.01 UTC] Chain extended, new tip: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885

[core-nyc:cardano.node:Info:45] [2020-10-15 05:02:56.01 UTC] fromList [("time(ps)",Number 1.788674423921033e18),("kind",String "MeasureTxsTimeStop"),("slot",Number 1.1171885e7)]

[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.01 UTC] Adopted block forged in slot 11171885: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4, TxIds: []

[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []

… but then when the next block/slot comes along (block 56*) ~20 seconds later, at the time of taking on that block it switches to another fork… in light of the messages seen below on our relay, rejecting our own block in favour of a block we didn’t produce (9d*) when taking that fork:

[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.53 UTC] Block fits onto some fork: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.53 UTC] Block fits onto some fork: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903

[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.54 UTC] Valid candidate 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903

[core-nyc:cardano.node.ChainDB:Notice:35] [2020-10-15 05:03:14.54 UTC] Switched to a fork, new tip: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 1117190

On our local relay, with a gigabit connection to the block producer, we can see that it drops our block (89*) in favour of the “other guy’s” block (9d*) about 100 milliseconds after it receives the block from our own core:

[relay-ny:cardano.node.ChainDB:Notice:43] [2020-10-15 05:02:56.02 UTC] Chain extended, new tip: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885

[relay-ny:cardano.node.ChainDB:Info:43] [2020-10-15 05:02:56.13 UTC] Block fits onto some fork: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[relay-ny:cardano.node.ChainDB:Info:43] [2020-10-15 05:02:56.14 UTC] Valid candidate 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[relay-ny:cardano.node.ChainDB:Notice:43] [2020-10-15 05:02:56.14 UTC] Switched to a fork, new tip: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

At this point I guess our relay, and maybe lots of other nodes out there, within a fraction of a second have two blocks produced in the same slot.

  • What I don’t understand: how our own relay, already having confirmation that our own block is genuine, can immediately fork in favour of the “other guy’s” block.

Meanwhile, about 100 milliseconds later (apparently: see time sync question below), on our “remote” relay (other the other side of the world than the core node, but great network connectivity):

[relay-sg:cardano.node.ChainDB:Notice:40] [2020-10-15 05:02:56.27 UTC] Chain extended, new tip: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885

[relay-sg:cardano.node.ChainDB:Info:40] [2020-10-15 05:02:56.50 UTC] Block fits onto some fork: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885

[relay-sg:cardano.node.ChainDB:Notice:40] [2020-10-15 05:03:14.27 UTC] Chain extended, new tip: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903

In this case it receives the “other guy’s” block (9d) about 200 milliseconds before the one that we’ve produced (89*), and snubs our block in favour of theirs: continuing on the same fork that our core node also selected in the next slot (tip 56*).

FYI we have far under-loaded nodes for both core & relays, with great connectivity between each other & the outside world: also confirmed by no dropped or delayed connections anywhere around the time between the two slots above.

My biggest, main question: When we produced this block on time, and according to our VRF calculation were the predetermined slot leader, how exactly did/can/might another block just push ours out of the way? Not just specifically in this case, but also in general?

  • I’ve tried to read the Ouroboros paper, with maybe only a layman’s understanding of it (I’m an engineer, not a scientist) and my main goal is to find any weakness in our architecture that may have made us vulnerable to this.
  • I remember reading in some cardano-node issue that something like “VRF alternates” were possible in mainnet, and someone mentioned on the testnet people were setting back their system time to try to “game the system” of selecting between alternates, but I can’t find that reference now.

Other questions:

systemd-timesyncd - We’re using this (instead of ntpd), which is the default on Ubuntu 20.04 servers and considered, as far as I’ve investigated, good enough for timing sensitive server applications, as long as the machines themselves don’t have to be time servers (both our core and relays are only time sync clients). Could a time discrepancy be an issue here?

  • There’s a Forge log message above (OutcomeTraceForgeEvent), with another message appearing to be a measurement of the transaction time, but I can’t tell whether these might indicate a time discrepancy.

pooltool.io > leaderLogs.py - Coincidentally we ran this the first time last night to determine our leadership schedule. (This is a brilliant tool and I’m very happy to have found it, with much appreciation for people digging that algorithm out of the code.) Our leadership schedule this epoch was completely consistent with our actual block production up until the time came to produce the very next block on that schedule, which was missed.

  • In case this is an issue we’re running the script on our core node, to avoid duplicating our VRF key file.
  • I am not being paranoid, and I feel safe the code is only using library routines to reproduce what Ouroboros is doing, but if anyone has any idea how this could have influenced the situation then please let us know. (No I didn’t tweet when our next block was coming out :crazy_face: and we show no signs of DoS in any case. :sunglasses:)

Values of the VRF outputs, normally used to check if you have the right to sign this block, are also used to select the winning block in case of a battle between several nodes (two or even more). It means that slots battle are won or lost based on pure luck, in a deterministic fashion : the lowest VRF output gets the block on chain.

So there is nothing to brag about winning or be desperate about loosing.

2 Likes

I’m not sure if this is the issue I originally found… but a number of related issues point here as the solution after the problem was originally pointed out earlier in April:

So then all we are still looking for is a subjective / practical understanding of how, when, why, and maybe how often the VRF will allow multiple block producers in any given slot to believe that they are the slot leader. I missed this in what might be my last attempt to read the academic papers, so a specific reference would be nice. :innocent:

Obviously we have to trust that the system will allow us to win a slot battle as often as losing one, in whatever ratio assures every pool’s probability of producing a block remains proportional to its portion of the active stake. Now that we know what a slot battle looks like, we mainly need to anticipate roughly how often these will occur.

1 Like
1 Like

It does not matter, slot battles are eliminated by the protocol by selecting smaller VRF values in a case when two blocks (for the same slot) are distributed to the network from different pool (different VRF). As @Psychomb has already mentioned above.

1 Like

And to be precise it’s not based on pool’s VRF vkey (as everybody would create smaller VRF by brute force), but the calculated leader election value in the header.

1 Like