String of orphans

mdsullivan · 10 March 2024 18:22

Hi gang - last two epochs, my pool has a sudden string if orphans and i cant quite figure it out. each one is a height battle that i lose. My propagation time seems fine, averaging about 500ms. gLiveView on my BP says i was the leader and the block was adopted. But on pooltool.io it shows it was orphaned.
When I look at the orphan details on pooltool, it looks like my block shows “nodes reporting” as just one, where the pool that beats me has sometgink like ~275 nodes reporting. Ive got 2 published relays and another that is not, all in p2p mode. Things were working all fine until about a week ago.
All my relays have plenty of connections, and pooltool shows my two public relays as online.
i thought i was having a dns resolution service issue, but i fixed that and still missed the next block.
any troubleshooting advice would be helpful.
Thanks!
-Sully

Terminada · 10 March 2024 21:00

I checked my logs and my block producer node did not receive any block for that slot. That is to say that other nodes in the Cardano network did not propagate such a block to my relays. And, my block producer did receive that FLAC block 780ms after the start of slot 118516297.

Your leadership allocation was for slot 118516294. FLAC pool produced a block for slot 118516297 (3 seconds later) and this block was produced upon the same parent block as your block and was propagated to the network. It won the “fork battle” presumably because its block VRF value was lower. Your block was at least propagated to 1 or more nodes that reported receiving it to pooltool (but this pooltool reporting could include that from your own nodes?).

If your block had been received by FLAC pool, within the 3 second window before it produced its block, then presumably it would have produced its block upon yours as the parent.

So either FLAC pool didn’t receive your block in time or FLAC pool is running modified software which deliberately creates forks when it knows its block VRF is lower. (See this link for an outline of this attack and a possible software fix: Subpar handling of blocks from the future · Issue #4251 · IntersectMBO/ouroboros-network · GitHub )

Most likely the problem is the former since my relays did not receive your block prior to receiving the FLAC block either.

mdsullivan · 10 March 2024 21:13

But my propagation times are good and i have chronyc running to stay time synced. All three of my last epoch blocks did the same thing. I’ve minted plenty of blocks on 8.7.3 without changing anything prior to this. just not sure what to trouble shoot next.

COSDpool · 11 March 2024 11:09

The only time I’ve ever seen more than statistically plausible numbers of orphans in a short time period would be after the number of Hot Peers on our relay co-located with our BP gets up past 300 towards around 350, with over 400 full-duplex connections (at these times the load average of our 12-core machine is still only around 1, so the system itself doesn’t look overloaded).

Checking time sync and propagation delays was also the first thing I thought of: both optimal. So I’m operating under the assumption now that this number of P2P connections isn’t really vetted by IOG, since we only started to see it after our stake increased and I guess we became more attractive as a “ledger peer” proportionally to the square of our ratio to total stake… perhaps a difficult thing to simulate under laboratory conditions.

Since then I’ve set the policy of rebooting that relay (building those Hot Peers up from a small number again) every time I see more than one orphan in a short time period. Generally we can go 100 blocks without losing a slot battle and that seems to restore this performance. So I would be curious whether others have been seen a “string of orphans” with an equivalent number of hot peers / duplex connections, and what they might have done in response.

mdsullivan · 11 March 2024 14:08

I have all my relays config.json file set to limit the number peers like below. Im assuming this it what keeps my hot peers from growing as large? So im not thinking that is my current personal issue.
I have 2 published relays that maintain under 100 connections, with 20-25 hotpeers at a time. I have a third unpublished relay that has less incoming connections, but still has 20 hotpeers.

Seems like the blocks my bp is making shoukd have no problem getting out to chain.

courtesy of the Armada-Alliance

Terminada · 11 March 2024 21:11

I would go back to first principles and do the following:

Check logs on block producer to see when and how long it took to make the particular block
Check logs on relays to see how long it took them to download your block
Confirm all system times on all machines are synchronised and correct

brouwerQ · 12 March 2024 08:37

Does your BP report to pooltool? Because then that reporting node could just be yourself…
Did you check your relays got the block from your BP?

nemo83 · 12 March 2024 19:48

@mdsullivan I’ve had the same issue, and the problem was that the relay were not pulling blocks from BP.

I had just switched to p2p topology and I messed up w/ the valency.

{
  "localRoots": [
    {
      "accessPoints": [
        {
          "address": "relay.cardano-mn.svc.cluster.local",
          "port": 3000
        }
      ],
      "advertise": false,
      "valency": 1
    },
    {
      "accessPoints": [
        {
          "address": "block-producer.cardano-mn.svc.cluster.local",
          "port": 3000
        }
      ],
      "advertise": false,
      "valency": 1
    }
  ],
  "publicRoots": [
    {
      "accessPoints": [
      ],
      "advertise": false
    }
  ],
  "useLedgerAfterSlot": 110332824
}

in the config above I had both relay and bp in the same accepoints with valency one, so basically my relays were talking to eachotehr and ignoring bp. I split in 2 groups and valency 1 OR keep them together and valency 2,

How’s your topology?

mdsullivan · 13 March 2024 11:04

Thanks for the tips, everyone. I think im good now. Looks like it was a couple overlapping events. I was better able to diagnose the issue by putting my BP back in non-p2p mode to look at the peers list and see which of my relays were incoming vs outgoing. There was an error on my dns resolution server that was preventing the connection between my relays and bp. Though I fixed that, I also changed my topology to prevent that possibility in the future.
As a couple had guessed, my one “node reporting” ended up being my bp reporting through .sendtip, so the block it made didn’t actually get to my relays.

I’ve seen a decent number of other pools have problems as they switch the final relay in their pools to full p2p. Some of the primary documentation sites out there are not nearly as clear on p2p as what exists for the legacy topology. So resources like Cardano Forum, Armada Alliance, and Cardano SPO best practices are invaluable in keeping our blockchain healthy. The camaraderie is a reminder of what drew me to the Cardano community over 4 years ago. So thanks!

The true test is my next block 2 days from now, but pretty confident we’re back in business again.

Thanks again!

Topic		Replies	Views
Orphaned block Operate a Stake Pool	22	1827	7 April 2021
Increased number of Orphaned blocks Operate a Stake Pool	21	1607	11 March 2022
Missed Blocks Operate a Stake Pool	9	640	3 August 2021
Troubleshooting why my block got orphaned Operate a Stake Pool	1	592	2 November 2022
Missed a block but not an orphan. Can you figure out what happened? Operate a Stake Pool	6	737	1 March 2022

String of orphans

Related topics