Problem with increasing blocksize or processing requirements

I see talk about wanting to increase the block size and/or increase node processing requirements through an increase in memory units. However, it is important to consider what effects such changes will have on Cardano’s decentralisation.

How full the block is already seems to make quite a bit of difference to how long it takes to distribute that block. Mostly my block producer in Australia can distribute its blocks in under 1 second to the majority in USA/EU.

However, check out this block my BP just produced. It was full at 86.57kB in size, containing 64 transactions, and 66.17kB of scripts: https://cexplorer.io/block/c740f9ce8b25410ddb938ff8c42e12738c18b7fd040ae5224c53fb45f04b3ba0

These are the delays before each of my own relays included this block in their chains:

  • Relay 1 (ARM on same LAN) → Delayed=0.263s
  • Relay 2 (AMD on adjacent LAN) → Delayed=0.243s
  • Relay 3 (ARM approx 5 Km away) → Delayed=0.288s
  • Relay 4 (AMD Contabo vps in USA) → Delayed=2.682s
  • Relay 5 (ARM Netcup vps in USA) → Delayed=1.523s

The average propagation delay by nodes pinging the data to pooltool was 1.67 seconds: Cardano PoolTool - The most comprehensive staking statistics for Cardano on the web.

As you can see, it is already too difficult for a BP using first world (Australian) fibre infrastructure to distribute its blocks from Australia in under 1 second to the rest of Cardano. My BP couldn’t do it with this block. Any increase in block size or processing requirements will likely make propagation delays worse.

The centralisation problem

Any delay that causes propagation to increase beyond just 1 second will result in those blocks getting involved in more “fork battles”. In fact just 1 second of delays will see your battle count triple from 5% of your blocks to 15%. These battles are settled by nodes preferring the block with the lowest VRF result. The more fights you have, the more you will lose, since you will lose half of these battles on average. If your block delays increase to 2 seconds then 25% of your blocks will get involved in “fork battles”.

The way to minimise “fork battles” is to warehouse your BP in an Amazon data centre in USA or EU because this will minimise your propagation delays with the majority. But such a solution is not geographic decentralisation, and hands a lot of power to Jeff Bezos because now he can manipulate network delays within his data centre to selectively control who sees what and when.

Granted, Australia is on the opposite side of the world to the majority, but surely this might also be a good thing for enhanced resiliency and decentralisation.

I have written a CPS about this problem where I propose a solution:

7 Likes

Coincidentally, the last two TERM blocks in a row happened to have a leader in the next slot and provide a great illustration of the problem:

  1. A full block which pooltool reported avg propagation time of 0.87s → But despite the average reported propagation being less than 1 second, the next producer IOGP did not receive it in time for the next slot and created a fork. Unfortunately IOGP’s block had the lower VRF so TERM lost the “fork battle” and got its block orphaned. That TERM block does not contribute to the chain or receive any rewards.
  2. A small block which pooltool reported avg propagation time of 0.62s → Which fortunately was received by the next producer TLK in time to produce its block at the next slot. So there was no fork and TERM’s block did contribute to the chain.
1 Like

Another important consideration is that it is possible to maliciously game these forks.

The block VRF only depends on the following inputs:

  • Epoch nonce
  • Slot number
  • Pool private key

Therefore the block VRF is known ahead of time.

A malicious operator can run a modified version of cardano-node that inspects the previous block VRF, compares this to its value, and decides whether to deliberately cause a fork or not if it knows it will win the “fork battle”. This would allow a malicious group of pools to deliberately “orphan” blocks of other competitors in order to earn a higher percentage of the reward pot and gain more control over consensus.

2 Likes

Having to use a VPS and opting to use one of the big ones like AWS is indeed I think an affront to one of the nearly universal guiding principles of most blockchain/cryptocurrencies, which is decentralization. Even if Bezos wasn’t going to do something intentionally malicious, if a majority of stakers are using AWS and is goes down (which is has and surely will again), that could significantly affect transactions. I’ve seen this be used as a suggested means to stake for other cryptos and I tend to balk at the idea.

Practicality cannot be ignored of course, but I tend to prefer solutions which are as favorable to individual staking (meaning lower hardware and bandwidth requirements) for a few reasons, but the danger of consolidation (purposeful or otherwise) being the big one.

1 Like

Interesting considerations and insights!
Thank you for sharing this, I will consider that if it comes to a voting.
Best Regards!

1 Like

Can you pls elaborate on these? How did you calculated those as for me I calculated these numbers for blocks being created in the same time ranges:

Time range Probability
0-1 0.12%
0-3 1.02%
0-5 2.65%
0-10 9.02%
1 Like

Yes, there are problems with increasing block size.

The network propagation delay of blocks is tied to the way TCP operates during “slow start”. During this mode the amount of data in flight doubles every round trip.
The table below is a guesstimation of number of round trips needed for different block sizes assuming an initial congestion window, of10 segments and a maximum segment size payload of 1460.

Block Size Round Trips
0 … 145600 1
14601 … 29200 2
29201 … 58400 3
58401 … 90112 4

The guard rails in the constitution has identified this problem
“MBBS-06 (x - “should”) The block size should not induce an additional Transmission Control Protocol (TCP) round trip. Any increase beyond this must be backed by performance analysis, simulation and benchmarking”, from draft-constitution/2024-12-05/draft-constitution-converted.md at main · IntersectMBO/draft-constitution · GitHub . Using previous assumptions the next Round Trip would happen above 116800 bytes.

The Round Trip Time is limited by light speed, if you have a good uncongested network connection it is what it is. However you can control the number of Round Trips a block requires by tweaking a few system options.

By default if a connection remains unused for a short period of time the congestion window will shrink back to 3 or 10 segments. You can disable this with:
sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0
This means that if you have a large congestion window to, for example to your relays in USA you may be able to send the entire block in one Round Trip Time.

You can also increase the initial congestion window for sockets with:
ip route change default ... initcwnd 42 initrwnd 42
Use ip route show to get your default route and append initcwnd 42 initrwnd 42.
For example:
JUST PASTING THE EXAMPLE BELOW WILL NOT WORK AND FORCE YOU TO ACCESS YOUR SYSTEM THROUGH THE CONSOLE.
ip route change default via 192.168.0.1 dev ens5 proto dhcp src 192.168.0.150 metric 100 initcwnd 42 initrwnd 42

If your node has IPv6 enabled run ip -6 route show and append initcwnd 42 initrwnd 42 to that default route too.

I made the above changes on one of Cardano Foundation’s relays in Australia and added your relays as localroots. Did you notice an improvement in block block propagation time since 2024-12-07?

The graph labeled “TERM Block Delay to Paris”, shows the time it took a block you forged to reach a Cardano Foundation node in Paris. That node in Paris has no localroots, it makes all its connection through ledger or peersharing. Even with so few samples it shows a statistical significant decrease in delay in the range of 640ms to 290 ms for 4 Round Trips blocks.
I’d be very interested to hear of your experience after making the above changes to one or two of your relays in Australia.

3 Likes

Thank you very much for explaining that @karknu. I need to read more so that I fully understand all implications of your suggested changes.

I have also been changing things at my end which complicates the analysis. For example, I sometimes run my BP on an ARM machine which does not have a very fast processor but does have good networking capability. Currently I am running my BP on a more performant server with AMD processor. I have also experimented with using the linux tc tool to gain more control over buffers and prioritise cardano-node traffic, but I didn’t think this made a lot of difference.

Even though I have a very reliable business level fibre service, I am seriously considering purchasing an additional High Class of Service 1:1 contended service, with guarantees around contention, frame delay, and some other parameters.

Thank you for doing that.

That also raises some questions I had regarding the settings in config.json about the number of ActivePeers: On a couple of my relays I changed the config.json values from defaults as follows:

  "TargetNumberOfActivePeers": 50,
  "TargetNumberOfEstablishedPeers": 100,
  "TargetNumberOfKnownPeers": 200,
  "TargetNumberOfRootPeers": 80,

I did this because I thought having more ActivePeers would distribute my blocks quicker. But I was concerned that if the node got too many active peers and they all requested blocks at the same time as my BP produced its block, that localroots won’t be prioritised and therefore my block could be delayed.

I theorise that this shouldn’t be a problem for my relay on the same LAN as my BP. However I worry about this for my relays that are across the internet from my BP since their bottlenecks will be their internet links and maybe if too many Cardano nodes request blocks from these relays right at the time my BP produces a block then my block could be delayed.

So if all 50 ActivePeers were trying to download 80Kbyte blocks at the same time (unlikely) then this could be more than 32 Mbits of data at that instant? And my relay 5Km away from my BP has a fibre link of 100Mbits/sec. The internet connection where my BP is located is 500Mbit/s down and 200MBit/s up.

Do you have a feel for how to set the TargetNumberOfActivePeers and whether setting this too high will make things worse?

In any case, these networking optimisations and fibre link performance issues become more problematic when you are located on the other side of the world from the majority and there is only 1 second to play with.

1 Like

The TargetNumberOfActivePeers config value controls how many peers your node will download headers and blocks from. That is it controls the number of upstream peers your node will have. It doesn’t directly impact how many peers will connect to and download blocks from your node (downstream peers).

Your relays will let its downstream peers know that it has a new block by sending them the new header. If those peers doesn’t have that block they may decide to download it from your relay. Unless it is a block that your BP just produced it is unlikely that all your downstream peers would be interested in the block. And since the downstream peers talking to you would be anywhere from a few ms away (same data center) to hundreds of ms (other side of the world) they would not learn of and respond to your new block at the same time.

Rather than run with 50 TargetNumberOfActivePeers I think you would be better of pairing up with some other stake pool operator in Australia. That is add one or two of each others relays in Australia as untrusted localroot peers.

3 Likes

I had thought that this was a total number of active peers irrespective of which end initiated the TCP connection. Thanks for explaining that. So is there no way to regulate the number of downstream peers that can maintain connections and request to pull blocks all at once?

I am thinking of the problem when there are 2 blocks for successive slots with the first produced by some other pool and the second produced by my own BP. If a lot of downstream nodes request the first block from my relay (an external one not on the LAN with the BP) right at the time my BP produces the second block then this might temporarily saturate that relay’s internet link and cause a delay in it pulling my block from my BP? This is why I was thinking of using the tc tool to prioritise traffic to/from the BP but am not sure if it would make any difference?

I thought that might be the case but wasn’t sure re: the P2P mechanism. But why an Australian one? I assumed I would be better off getting IOG/CF/Emurgo or even Binance to add my relays as localroots because they will be nearer the centre of the network due to all the blocks they produce attracting other relays via the P2P mechanism?

Also requiring specific permissioning from other pool operators seems like something that shouldn’t be required since Cardano is trying to be permissionless and all that.

1 Like

So is there no way to regulate the number of downstream peers that can maintain connections and request to pull blocks all at once?

You can’t force a peer to decide to have your relay as a hot upstream peer (that is fetch headers and blocks from). You can limit the number of incoming connections, but if you want your blocks to propagate quickly across the world you want lots of downstream peers.

When your node has downloaded a new block a few things happens. It will let all its downstream peers know that it has a new block. It will start to adopt a block, this can take any where from a few ms to hundreds of ms. It will send the block to the downstream peers that request it. Once the block is adopted it will remove the TXs in the block from the mempool. If the mempool was full this will cause it to request and apply new TXs from its downstream peers. In the scenario above I think that your relay would be more limited by CPU capacity, than network capacity. cardano-node will use multiple cores so if you have lots of cores don’t stick with -N2 or -N4.

But why an Australian one?

If a relay in Australia has a new block it is a lot quicker for you to fetch it from that relay than to get it from Europe.

Also requiring specific permissioning from other pool operators seems like something that shouldn’t be required since Cardano is trying to be permissionless and all that.

You don’t need permission to add another SPO’s relays as localroots. What I was proposing was a mutually beneficial agreement between you and another SPO in Australia. You would connect to each other relays, which means that you would help propagate each others blocks. Since you are both near each other you would likely be the first one to present the other pools block’s to your downstream peers. This would make your relays more popular and you would attract more downstream peers which in turn would help propagate your pool’s blocks.

3 Likes

@karknu Thanks again for pointing this out.

For those following:

I have disabled TCP “slow start” and set the initial congestion window size and initial advertised receive window size to persist across reboots as follows:

Added the following to /etc/sysctl.d/local.conf:

net.ipv4.tcp_slow_start_after_idle=0

Then activated the change with: service procps force-reload

Added the following to /etc/systemd/network/20-eth0.network:

[Route]
Gateway=X.X.X.X
# Multiples of MSS = 1460 bytes (defaults are 10)
InitialCongestionWindow=32
InitialAdvertisedReceiveWindow=32

Then activated this with: systemctl daemon-reload and networkctl reload
Note: If not using systemd already to configure network devices then extra setup is required.

Now I see the following:
ip route show

default via 152.53.52.1 dev eth0 proto static initcwnd 32 initrwnd 32

sysctl net.ipv4.tcp_slow_start_after_idle

net.ipv4.tcp_slow_start_after_idle = 0

And ss -n -i -t -e shows cwnd values like:

ESTAB 0 0 X.X.X.X:X X.X.X.X:X … cgroup:/system.slice/cardano-node.service ↔ ts sack … cwnd:32 …
ESTAB 0 0 X.X.X.X:X X.X.X.X:X … cgroup:/system.slice/cardano-node.service ↔ ts sack … cwnd:103 …

Now I am wondering if it would be better to increase the InitialCongestionWindow and InitialAdvertisedReceiveWindow values further, since 1460 * 32 = 46720 is still smaller than the current max Cardano block size?

The other end of each TCP connection sets its own TCP congestion window size. I have configured all my relays and BP similarly so that at least the connections between all my relays and BP will use these larger TCP windows.

Should we be encouraging other pool operators to increase their TCP window sizes???

3 Likes

Now I am wondering if it would be better to increase the InitialCongestionWindow and InitialAdvertisedReceiveWindow values further, since 1460 * 32 = 46720 is still smaller than the current max Cardano block size?

In my change I set the initial congestion window to 42 segments. The congestion window will still change based on traffic conditions. That is packet loss will cause it to decrease and lack of packet loss will cause it to increase. With net.ipv4.tcp_slow_start_after_idle=0 the window will stay open so it can get to a point where it can send a full sized block in one burst.

Should we be encouraging other pool operators to increase their TCP window sizes???

Yes. We don’t need everyone the make these changes but if a few SPOs on every continent made the changes it will make larger blocks propagate more quickly across the globe.

3 Likes

I have been running ss -n -i -t -e episodically on my BP and other relays to see what happens with the connection window sizes.

For example ss -n -i -t -e on my BP just now:

  • My relays in Australia have cwnd values of: 141, 79, 68, 74. So they all seem to be happy pulling an entire block in one go with no dropped packets. Although 3 of these relays only need to connect over a switch so you wouldn’t expect dropped packets. The other is 5Km away across the internet and this is the one with the highest cwnd of 141.
  • My netcup relay in USA has cwnd value of 32. But, the network card was set to start at this value. This could mean that the BP hasn’t had to pull a larger block from it yet, or it could mean dropped packets occur when trying to increase above this value.
  • My contabo relay in USA has a cwnd value of 12. So there must be some dropped packets necessitating a reduction between that relay and the BP.

I will experiment with setting the initial window values on the network adapters to 64 because this should result in the largest block getting pulled in one go without any waiting for ack packets.

@karknu I am trying to figure out why you chose the value 42 because that is still less than max block size, right? Or maybe I am calculating incorrectly? Though maybe you are a Hitchiker’s Guide to the Galaxy fan? :slight_smile:

By the way, my block receipt times seem to have reduced when eyeballing the output from my monitoring script. However, most blocks on Cardano seem to be smaller recently. For example the maximum size block my BP has produced, out of the 4 blocks so far this epoch, is only 24265 bytes.

@karknu If it is not a privacy or security concern for CF would you mind giving me the address and port for the CF relay you mentioned you added one of my relays to as a localRoot? Because, I will reciprocate and add that CF relay to all my Aussie relay localRoots. Also, if I could be so bold to ask: Would you please use relay2.terminada.io (180.150.96.245), port 8376 as a localRoot since this relay is most performant, closest to BP.

Another thing I do episodically, every few days, is restart the cardano-node service on my various relays to get a more contiguous block of memory allocated for the Haskell processes. I believe the cardano-node software runs a bit slower after a few days due to the Haskell garbage collector gradually fragmenting allocated memory over time.

@karknu Do you have any thoughts about doing those restarts because that will obviously break existing connections with other relays and could result in those relays “preferring” other peers via the P2P mechanism.

1 Like

After increasing the starting TCP window size to 64 segments on my BP and all my relays ss -n -i -t -e on the BP now shows cwnd values of 138, 135, 132, and 170 for all my Aussie relays.

However, the cwnd values for my Netcup relay in USA has reduced to 44 and the cwnd value for my Contabo relay in USA is down to only 6 at times! Presumably more packets get lost between Australia and these USA vps providers necessitating these connections to narrow the TCP window size before an ACK is expected. Even with cwnd at 44 a full block will require an extra ACK packet round trip delay from Australia to USA.

When I run ss -n -i -t -e on my USA relays, the vast majority of TCP connections have cwnd values 64 or greater but there are some as low as 3.

Thinking about all this makes me wonder about the following question:

Does the P2P mechanism take into account this TCP window size when choosing to prefer one peer over another?

If P2P preference is based upon how fast the peer offers a previously unseen block header then it could still be faster to download the full block from another peer that has less dropped packets along its network route (thus requiring less round trip ACKs)?

Does anyone know if the P2P mechanism takes into account the total time to pull a full block from the peer when it prefers one peer over another? Or does it prefer peers based upon how fast they notify about new block headers, which are small in size?

2 Likes

Not directly.
The blockfetch miniprotocol will attempt to fetch the block from the peer which it estimates can deliver it first. This estimation is based on round trip times measurements, but doesn’t include any TCP windows. This works the same regardless if you run with p2p enabled or not. Practically this means that the node will fetch the block from the first peer that presents a header for the new block. It was first so no other peers have told us about the block yet. However the node can attempt to fetch the latest block from multiple peers, and then the estimation mentioned previously will come into affect. The number of additional peers your node may attempt to fetch a block from is controlled by the MaxConcurrencyDeadline config value, 4 is a good value for an SPO’s relay.

P2P will rank upstream peers based on how often they where the first to present the node with a new header or a new block. About once per hour the 20% worst performing hot peers will be replaced with a new set of random peers.

3 Likes

thanks @karknu … I’ve been waiting over 4 years for someone to explain why that parameter setting was recommended.

2 Likes