This is not a tutorial but I hope it would be a good exercise for those who’ve analysed these things before (with my gratitude for that help), and a learning experience for SPOs who worry about the same thing.
For all the documentation effort, this may simply be a run-of-the-mill illustration of the “slot battle” I’ve heard of but never experienced yet (that I know of… we might have been on the winning side before without realising it).
We’ve just “lost” our first block and I’m trying figure out what happened. The long block hex IDs may make this difficult to read but I wanted these to be readable in blockchain explorers. (This post is not all log messages.)
First, our core node shows we’ve produced a block (89*
) in exactly the slot expected in our leadership schedule, and that it extends our copy of the chain with the new block:
[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.00 UTC] Leading slot 11171885
[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []
[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.01 UTC] Forged block in slot 11171885
[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []
[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList [("kind",String "OutcomeTraceForgeEvent"),("difftime",Number 5.74117e8),("slot",Number 1.1171885e7),("mempoolbytes",Number 0.0),("mempoolnumtx",Number 0.0)]
[core-nyc:cardano.node.ChainDB:Notice:35] [2020-10-15 05:02:56.01 UTC] Chain extended, new tip: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885
[core-nyc:cardano.node:Info:45] [2020-10-15 05:02:56.01 UTC] fromList [("time(ps)",Number 1.788674423921033e18),("kind",String "MeasureTxsTimeStop"),("slot",Number 1.1171885e7)]
[core-nyc:cardano.node.Forge:Info:45] [2020-10-15 05:02:56.01 UTC] Adopted block forged in slot 11171885: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4, TxIds: []
[core-nyc:cardano.node.ForgeTime:Info:45] [2020-10-15 05:02:56.01 UTC] fromList []
… but then when the next block/slot comes along (block 56*
) ~20 seconds later, at the time of taking on that block it switches to another fork… in light of the messages seen below on our relay, rejecting our own block in favour of a block we didn’t produce (9d*
) when taking that fork:
[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.53 UTC] Block fits onto some fork: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885
[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.53 UTC] Block fits onto some fork: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903
[core-nyc:cardano.node.ChainDB:Info:35] [2020-10-15 05:03:14.54 UTC] Valid candidate 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903
[core-nyc:cardano.node.ChainDB:Notice:35] [2020-10-15 05:03:14.54 UTC] Switched to a fork, new tip: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 1117190
On our local relay, with a gigabit connection to the block producer, we can see that it drops our block (89*
) in favour of the “other guy’s” block (9d*
) about 100 milliseconds after it receives the block from our own core:
[relay-ny:cardano.node.ChainDB:Notice:43] [2020-10-15 05:02:56.02 UTC] Chain extended, new tip: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885
[relay-ny:cardano.node.ChainDB:Info:43] [2020-10-15 05:02:56.13 UTC] Block fits onto some fork: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885
[relay-ny:cardano.node.ChainDB:Info:43] [2020-10-15 05:02:56.14 UTC] Valid candidate 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885
[relay-ny:cardano.node.ChainDB:Notice:43] [2020-10-15 05:02:56.14 UTC] Switched to a fork, new tip: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885
At this point I guess our relay, and maybe lots of other nodes out there, within a fraction of a second have two blocks produced in the same slot.
- What I don’t understand: how our own relay, already having confirmation that our own block is genuine, can immediately fork in favour of the “other guy’s” block.
Meanwhile, about 100 milliseconds later (apparently: see time sync question below), on our “remote” relay (other the other side of the world than the core node, but great network connectivity):
[relay-sg:cardano.node.ChainDB:Notice:40] [2020-10-15 05:02:56.27 UTC] Chain extended, new tip: 9d2e8333fbba3684f2e4d68350693287a6aff65fbc8be1998cd5f9b55031464f at slot 11171885
[relay-sg:cardano.node.ChainDB:Info:40] [2020-10-15 05:02:56.50 UTC] Block fits onto some fork: 8933a8197930da4a54efdbdb702519e58c76868ce70c6b262974e6bdbca430d4 at slot 11171885
[relay-sg:cardano.node.ChainDB:Notice:40] [2020-10-15 05:03:14.27 UTC] Chain extended, new tip: 56dbb84bbbc5dfe0caf6b2b304aba7cc8e69c0dc0ec0872c51aeec9cf23c6894 at slot 11171903
In this case it receives the “other guy’s” block (9d
) about 200 milliseconds before the one that we’ve produced (89*
), and snubs our block in favour of theirs: continuing on the same fork that our core node also selected in the next slot (tip 56*
).
FYI we have far under-loaded nodes for both core & relays, with great connectivity between each other & the outside world: also confirmed by no dropped or delayed connections anywhere around the time between the two slots above.
My biggest, main question: When we produced this block on time, and according to our VRF calculation were the predetermined slot leader, how exactly did/can/might another block just push ours out of the way? Not just specifically in this case, but also in general?
- I’ve tried to read the Ouroboros paper, with maybe only a layman’s understanding of it (I’m an engineer, not a scientist) and my main goal is to find any weakness in our architecture that may have made us vulnerable to this.
- I remember reading in some
cardano-node
issue that something like “VRF alternates” were possible in mainnet, and someone mentioned on the testnet people were setting back their system time to try to “game the system” of selecting between alternates, but I can’t find that reference now.
Other questions:
systemd-timesyncd
- We’re using this (instead of ntpd
), which is the default on Ubuntu 20.04 servers and considered, as far as I’ve investigated, good enough for timing sensitive server applications, as long as the machines themselves don’t have to be time servers (both our core and relays are only time sync clients). Could a time discrepancy be an issue here?
- There’s a Forge log message above (
OutcomeTraceForgeEvent
), with another message appearing to be a measurement of the transaction time, but I can’t tell whether these might indicate a time discrepancy.
pooltool.io > leaderLogs.py
- Coincidentally we ran this the first time last night to determine our leadership schedule. (This is a brilliant tool and I’m very happy to have found it, with much appreciation for people digging that algorithm out of the code.) Our leadership schedule this epoch was completely consistent with our actual block production up until the time came to produce the very next block on that schedule, which was missed.
- In case this is an issue we’re running the script on our core node, to avoid duplicating our VRF key file.
- I am not being paranoid, and I feel safe the code is only using library routines to reproduce what Ouroboros is doing, but if anyone has any idea how this could have influenced the situation then please let us know. (No I didn’t tweet when our next block was coming out and we show no signs of DoS in any case. )