Proof of concept: compressed chain database

werkof · 27 June 2023 18:31

The Cardano Mainnet database is constantly growing. A year ago the storage space requirement was 70GB, today we are at 125GB.

Source: Cardano Blockchain Insights

To find out which compression is easily possible, with reasonable efforts (for your servers) and what is maximum possible, I worked out and tested different solutions.

Important: Do NOT try to implement the solution described here just because you want to quickly solve a space problem on your servers. This is a proof-of-concept and not directly integrated into the Cardano-node. I.e. you really need to understand what you are doing to make your node work operationally in the medium and long term.

First I looked at filesystem based compression and deduplication. for example with ZFS. This achieves first improvements, but - especially deduplication - comes with considerable additional memory requirements.

Then I looked at solutions that work as tar archive files above the actual filesystem. Something like

In short - for now as initial post - ratarmount mounts a virtual folder, where it offers the content from a (read-only) tar archive, combined with a fully read&write enabled but uncompressed write-overlay folder.

Ratarmount offers a bunch of different options and tecniques. The best results I achieved by using pixie a parallel and indexing (!) version of xz

The compression is applied on most of the files in the nodes db/immutable files.

for this in db folder I created a subfolder

db/ratar
and therein 3 more subfolders
db/ratar/archive
db/ratar/indexes
db/ratar/writeoverlay

Then as a one-time preparation step I compressed the numbered files in the existing immutable folder

note:

cd into the existing immutable folder first, in order to create the tar archive without subfolders
I decided to create tar files in groups for the first two diggits of the immutable file names. Theoretically it’s also possible to create just one large archive.

cd db/immutable
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/00.tar.xz 00*.*
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/01.tar.xz 01*.*
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/02.tar.xz 02*.*
tar -c --use-compress-program='pixz -9 -p 6' --file=db/ratar/archive/03.tar.xz 03*.*

Then delete all the archived files, and move all the remained files (at the current chain state) all 04*.* files to the db/ratar/writeoverlay folder.

The 13290 files from 00000 to 03999.* have a file size of 101GB
The compressed tar archives shrink it down to 23.4GB

Now delete the empty immutable folder. ratarmount will create a virtual directory instead.
Also ratarmount is configured to read from and write to db/ratar/writeoverlay in parallel to the static read-only tar archives.

Now ratarmount needs to be started (ideally as a service, and as a dependency for the node service)

On its first startup it recognises the new archives and will index them, which takes a couple of minutes. Following startups are sub 1sec. The index files allow ratarmount to quickly and directly access the bytes in the large tar files.

Now the node can be started and will see db/immutable with all the required files. The node can also generate new immutable files out of his (untouched) volatile folder. ratarmount will store them in the writeoverlay folder.

This means that over time the amount of new and uncompressed files will grow. The node needs some maintainance window and your personal attention to move a bunch of new uncompressed immutable files (eg all 04*.*) into 04.tar.xz

So what is missing?
I ran this setup for 8 months now on two mainnet relays of CLIO1 pool and had 0 issues with it. No real measurable impact on performance (full NVME storage) just a ~75% and growing ratio of saved disk space. And - that’s an important cost - ratarmount will require an additional ~6GB of memory.

The db/ratar/archive folder content (the tar archives) can even be used by multiple node instances, but at a certain point this would definitively have negative and visible effects on node time budgets and operations. So more or less not recommended, depending on your HW ressources.

Last but not least, again: please don’t try this at home!

cyberruss · 27 June 2023 19:18

Nice analysis and write up. At the moment it feels like we are still comfortably in the storage range where SSDs are practical and cheap enough not to worry. Good quality SSDs like the EVO970 are reasonably priced up to 2 TB.

Currently storage growth is quite linear back to E208 with some variance but nothing major. We should be good for another 3-5 years before hitting the 1TB range and by then SSDs will have doubled in size. Indeed as long as SSDs keep pace with storage growth we are fine.

The wild card is whether chain performance is pushed a lot higher in that time on L1. If so the costs of operating a node may start increasing quickly, and this and network bandwidth will start to become a factor in the ability to push wider decentralisation. This speaks to the fact that L2s will likely be run by a smaller set of higher performance nodes that are additionally funded by the L2 usage.

Without significant increases in transaction bytes per sec on L1 chain growth should stay pretty manageable.

lauris · 27 June 2023 21:01

hey, thanks for this one. This was quite interesting to read.
Given the rapid growth of the db size, we’re definitely going to need some effective compression strategies soon.

I’m looking forward to trying out your approach on one of our backup relays. Plus, creating some user-friendly scripts would be super helpful for implementing this for others as well.

Thanks for sharing this idea!

Titan-C · 29 June 2023 16:08

It is great, but we should also voice our concerns about the block limit. It is too big for a public resource like a blockchain. It was unilaterally increased to support the bull market demand and now we are stuck there. We still haven’t created a market for fees like in bitcoin, which is a better way to price blockspace. On each era transactions take even more space, that is not appropriate for every network participant that run a node. A bloated blockchain leads to centralization. And nobody is incentivized to move to higher layers. Why hydra if block space is cheap?

Markus_Lall · 30 June 2023 16:45

Could you overlay the compressed size to the graph? You say it’s about 25% of what the uncompressed size is?

werkof · 1 July 2023 07:58

group   files   size GB     compressed GB
00*.*    3000       6.2               3.0
01*.*    3000       8.8               3.7
02*.*    3000      47.5               9.9
03*.*    3000      38.8               6.8
------------------------------------------
        12000     101.1              23.4

weebl2000 · 4 July 2023 08:15

I just run btrfs with zstd compression enabled for the immutable directory:

cardano/db# compsize immutable/
Processed 13452 files, 1001534 regular extents (1001534 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       51%       62G         120G         120G       
none       100%      706M         706M         678M       
zstd        51%       61G         120G         120G

thijs · 27 July 2023 13:44

I was just trying to achieve the same, however… it looks like it doesn’t want to compress the db at all.

Did you use the force compress option by any chance?

weebl2000 · 27 July 2023 13:58

I schedule a cron job that periodically compresses them.

#!/bin/bash

nice ionice -c3 btrfs fi defrag -czstd -vr /path/to/my/cardano/db/immutable/

thijs · 27 July 2023 14:15

Thanks.

There seems to be an autodefrag option… have you tried that one?

I’ve mounted the FS with compress-force=zstd:3 and by the looks of it, new files seem to be compressed now.

weebl2000 · 27 July 2023 15:40

I have played around with autodefrag, but only for spinning disks. I never really measured the performance difference.

I avoid compress-force because this machine is pretty low on CPU power.

thijs · 27 July 2023 15:54

Yeah as long as the performance is OK it’s fine, we are are more scarce on disk space rather than cpu.

For now I’ll run only one of our relays with this compression turned on… let’s see how it develops :).

Now that the DB has fully synced, I really like the results:

weebl2000 · 27 July 2023 21:25

Yeah, it’s really nice. I’m even running two relays (one testnet, one mainnet) on a raspberry pi 4 with 8GB ram. Large GCs take a bit long, but other than that it runs fine.

I’ve overclocked it a bit and it’s in a really nice case that provides ample cooling.

werkof · 4 August 2023 12:35

Just to have it mentioned somewhere, and by repeating again: this is just a proof of concept to see the effects and potential ranges of disk related ressources

Here’s a comparison of disk read-counts and -bytes

2023-08-04_14h29_21

The setup
a remote peer is synching up from genesis in two test runs
First from a node with ratarmounted immutable folder
Then from a standard node setup.

The CPU load is around +1% on the ratarmounted node
instead the read # and bytes are significantly less.

However this is in the rare case where a remote node is requesting the whole chain history. In tipical current block propagation mode the disk reads will not be affected at all.

It probably becomes more of a topic when Ouroboros Genesis is out, and the Stake Pools also have to serve the bootstrapping services, which they currently don’t (and many don’t even expect that to become a future demand and load)

werkof · 29 July 2024 08:16

Time is flowing and db is growing

group   files   size GB     compressed GB
00*.*    3000       6.2               3.0
01*.*    3000       8.8               3.7
02*.*    3000      47.5               9.9
03*.*    3000      38.8               6.8
04*.*    3000      37.5               5.8
05*.*    3000      36.7               5.6
------------------------------------------
        18000     175.3              34.8   (20%)

Keep in mind: this solution is a proof of concept only. It comes with some challenges and should not be implemented for solutions with a relative high ratio of historic UTXO lookups.

WomenPOOL · 29 July 2024 10:53

Very interesting and novel idea!

I assume you’ve observed the behaviour of the node beyond just the disk usage. Why is this not being promoted or pushed further into the “mainline” development?

T

werkof · 29 July 2024 12:02

well novel it was in September 2022, when I implemented it first and kept it up and running since then on a mainnet relay node.

Explaining the related effects needs to touch a bit more than just disk storage and compression effects.

The access times for random historic utxo data is significantly slower. In a sequential read (e.g. a full chain sync or replay) this is not very significant as ratarmount does always read ahead data for many compressed blocks. This results in slower read access times at the beginning of such a “bulk”, then compensated by slightly faster “from Memory” read times for the rest.
this means it negatively effects validation times for all incoming blocks, especially if they consume many different inputs from a long time ago.
due to header pipelining the relay nodes can already announce new block headers and serve block bodies to the producer, before they do the local validation. So it is ok for a relay node because the ratarmount caused delay for utxo lookups is not really relevant. But ratarmount on a producer node would delay the local validation, and any new block can only be minted after the previous one was fully validated and adopted.
Compression in archives only needs to be computed once. Decomression requires a bit of ongoing computing resources, but is absolutely no problem on a multi core system (eg 4+ CPU cores)
ratarmount itself requires some memory to run. If someone is hoping to keep their cheaper 250GB Disk, 32GB RAM VPS plan instead of upgrading to the 500GB Disk, 64GB RAM plan, this will not work with ratarmount because of the increased memory demand.
as mentioned ratarmount seems to have a memory leak. A relay on mainnet can easily run for 2-3 months, but after a full chain replay, a restart of the underlaying ratarmount service is highly recommended to avoid swapping or any other memory bottlenecks related effects.

This proof of concept shows what compression ratio is possible (5:1) by still keeping it a realistically operative solution for non producer nodes.

Topic		Replies	Views
Some dev questions from somebody starting Cardano Integration cardano-node	4	61	5 December 2024
Disk space on a node Operate a Stake Pool	23	2457	16 May 2022
IOPS & Disk Requirements for a stake node Operate a Stake Pool stake-pools	5	1459	30 July 2020
Tried to run cardano node and connect to mainnet on DigitalOcean's server Operate a Stake Pool	4	1336	28 October 2019
Cardano as a large data solution? Cardano Projects	1	637	27 January 2021

Proof of concept: compressed chain database

Related topics