Proof of concept: compressed chain database

The Cardano Mainnet database is constantly growing. A year ago the storage space requirement was 70GB, today we are at 125GB.


Source: Cardano Blockchain Insights

To find out which compression is easily possible, with reasonable efforts (for your servers) and what is maximum possible, I worked out and tested different solutions.

Important: Do NOT try to implement the solution described here just because you want to quickly solve a space problem on your servers. This is a proof-of-concept and not directly integrated into the Cardano-node. I.e. you really need to understand what you are doing to make your node work operationally in the medium and long term.

First I looked at filesystem based compression and deduplication. for example with ZFS. This achieves first improvements, but - especially deduplication - comes with considerable additional memory requirements.

Then I looked at solutions that work as tar archive files above the actual filesystem. Something like

In short - for now as initial post - ratarmount mounts a virtual folder, where it offers the content from a (read-only) tar archive, combined with a fully read&write enabled but uncompressed write-overlay folder.

Ratarmount offers a bunch of different options and tecniques. The best results I achieved by using pixie a parallel and indexing (!) version of xz

The compression is applied on most of the files in the nodes db/immutable files.

for this in db folder I created a subfolder

db/ratar
and therein 3 more subfolders
db/ratar/archive
db/ratar/indexes
db/ratar/writeoverlay

Then as a one-time preparation step I compressed the numbered files in the existing immutable folder

note:

  1. cd into the existing immutable folder first, in order to create the tar archive without subfolders
  2. I decided to create tar files in groups for the first two diggits of the immutable file names. Theoretically it’s also possible to create just one large archive.
cd db/immutable
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/00.tar.xz 00*.*
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/01.tar.xz 01*.*
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/02.tar.xz 02*.*
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/03.tar.xz 03*.*

Then delete all the archived files, and move all the remained files (at the current chain state) all 04*.* files to the db/ratar/writeoverlay folder.

The 13290 files from 00000 to 03999.* have a file size of 101GB
The compressed tar archives shrink it down to 23.4GB

Now delete the empty immutable folder. ratarmount will create a virtual directory instead.
Also ratarmount is configured to read from and write to db/ratar/writeoverlay in parallel to the static read-only tar archives.

Now ratarmount needs to be started (ideally as a service, and as a dependency for the node service)

On its first startup it recognises the new archives and will index them, which takes a couple of minutes. Following startups are sub 1sec. The index files allow ratarmount to quickly and directly access the bytes in the large tar files.

Now the node can be started and will see db/immutable with all the required files. The node can also generate new immutable files out of his (untouched) volatile folder. ratarmount will store them in the writeoverlay folder.

This means that over time the amount of new and uncompressed files will grow. The node needs some maintainance window and your personal attention to move a bunch of new uncompressed immutable files (eg all 04*.*) into 04.tar.xz

So what is missing?
I ran this setup for 8 months now on two mainnet relays of CLIO1 pool and had 0 issues with it. No real measurable impact on performance (full NVME storage) just a ~75% and growing ratio of saved disk space. And - that’s an important cost - ratarmount will require an additional ~6GB of memory.

The db/ratar/archive folder content (the tar archives) can even be used by multiple node instances, but at a certain point this would definitively have negative and visible effects on node time budgets and operations. So more or less not recommended, depending on your HW ressources.

Last but not least, again: please don’t try this at home! :wink:

14 Likes

Nice analysis and write up. At the moment it feels like we are still comfortably in the storage range where SSDs are practical and cheap enough not to worry. Good quality SSDs like the EVO970 are reasonably priced up to 2 TB.

Currently storage growth is quite linear back to E208 with some variance but nothing major. We should be good for another 3-5 years before hitting the 1TB range and by then SSDs will have doubled in size. Indeed as long as SSDs keep pace with storage growth we are fine.

The wild card is whether chain performance is pushed a lot higher in that time on L1. If so the costs of operating a node may start increasing quickly, and this and network bandwidth will start to become a factor in the ability to push wider decentralisation. This speaks to the fact that L2s will likely be run by a smaller set of higher performance nodes that are additionally funded by the L2 usage.

Without significant increases in transaction bytes per sec on L1 chain growth should stay pretty manageable.

1 Like

hey, thanks for this one. This was quite interesting to read.
Given the rapid growth of the db size, we’re definitely going to need some effective compression strategies soon.

I’m looking forward to trying out your approach on one of our backup relays. Plus, creating some user-friendly scripts would be super helpful for implementing this for others as well.

Thanks for sharing this idea! :blush:

3 Likes

It is great, but we should also voice our concerns about the block limit. It is too big for a public resource like a blockchain. It was unilaterally increased to support the bull market demand and now we are stuck there. We still haven’t created a market for fees like in bitcoin, which is a better way to price blockspace. On each era transactions take even more space, that is not appropriate for every network participant that run a node. A bloated blockchain leads to centralization. And nobody is incentivized to move to higher layers. Why hydra if block space is cheap?

Could you overlay the compressed size to the graph? You say it’s about 25% of what the uncompressed size is?

group   files   size GB     compressed GB
00*.*    3000       6.2               3.0
01*.*    3000       8.8               3.7
02*.*    3000      47.5               9.9
03*.*    3000      38.8               6.8
------------------------------------------
        12000     101.1              23.4
2 Likes

I just run btrfs with zstd compression enabled for the immutable directory:

cardano/db# compsize immutable/
Processed 13452 files, 1001534 regular extents (1001534 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       51%       62G         120G         120G       
none       100%      706M         706M         678M       
zstd        51%       61G         120G         120G    
2 Likes

I was just trying to achieve the same, however… it looks like it doesn’t want to compress the db at all.

Did you use the force compress option by any chance?

I schedule a cron job that periodically compresses them.

#!/bin/bash

nice ionice -c3 btrfs fi defrag -czstd -vr /path/to/my/cardano/db/immutable/

Thanks.

There seems to be an autodefrag option… have you tried that one?

I’ve mounted the FS with compress-force=zstd:3 and by the looks of it, new files seem to be compressed now. :slight_smile:

image

I have played around with autodefrag, but only for spinning disks. I never really measured the performance difference. :joy:

I avoid compress-force because this machine is pretty low on CPU power.

Yeah as long as the performance is OK it’s fine, we are are more scarce on disk space rather than cpu.

For now I’ll run only one of our relays with this compression turned on… let’s see how it develops :).

Now that the DB has fully synced, I really like the results:

Yeah, it’s really nice. I’m even running two relays (one testnet, one mainnet) on a raspberry pi 4 with 8GB ram. Large GCs take a bit long, but other than that it runs fine.

image

image

I’ve overclocked it a bit and it’s in a really nice case that provides ample cooling.

1 Like

Just to have it mentioned somewhere, and by repeating again: this is just a proof of concept to see the effects and potential ranges of disk related ressources

Here’s a comparison of disk read-counts and -bytes

2023-08-04_14h29_21

The setup
a remote peer is synching up from genesis in two test runs
First from a node with ratarmounted immutable folder
Then from a standard node setup.

The CPU load is around +1% on the ratarmounted node
instead the read # and bytes are significantly less.

However this is in the rare case where a remote node is requesting the whole chain history. In tipical current block propagation mode the disk reads will not be affected at all.

It probably becomes more of a topic when Ouroboros Genesis is out, and the Stake Pools also have to serve the bootstrapping services, which they currently don’t (and many don’t even expect that to become a future demand and load)