We’re running a well performing stake pool for nearly a year now but something has always bothered me which I have never been able to research properly: all our pool nodes spike their CPU use (also raising load average) exactly every 72 minutes.
Since it’s always happened on our core as well as our relays, I’ve understood that it isn’t a response to intrusion. In the early days the documentation on cardano-node was pretty sketchy, and it’s filled out a lot since then: but not with respect to internals, memory usage, debugging and tracing settings, etc… and I’ve been waiting in vain for an explanation of our observed 72 minute cycle.
What have I been missing all this time? I’ve noticed the node resource usage has been stepping up a little bit, probably from all the scaffolding coming in for Alonzo… our pool nodes have been well within performance limits during the quiet part of each 72 minute cycle, but are now pushing those limits during the several-minute spikes.
We’d just like to be well prepared for the coming weeks & hope some of the devs, dev oriented SPOs, or people with relevant empirical observations might share their insights. Before I begin the usual analysis & performance logging I would please just like to hear some others chime in, so we might have an idea what to look at first & even whether or not this is correctable (or even undesirable).
We’ve seen CPU cyclical spikes before. For us, this was due to the circular freeing of swap space on the disk, after which we decided to abandon swap and increase the RAM memory. Hope this helps you.
This observed cycle is from the ledger snapshot alone, currently taken by default at a 72 minute interval, set in the code by DefaultSnapshotInterval. The interval is adjustable in the config.json file with SnapshotInterval, measured in seconds. (One of the more cooperative IOG devs stepped in just now to the thread above to tell us this.)
Under some conditions even copying files, let alone writing the ledger snapshot, will use a lot of swap… I believe this is because file pages are first put in the buffer cache before writing them out to disk. On thinly provisioned servers, this pushes colder cardano-node memory pages out to swap even if you have vm.swappiness set to 0 like we have on our servers.
Then as @Vadimus said the kernel needs to come back and herd the swapped out pages back into memory. This is how the ledger snapshot can cause a cascade of CPU use. It’s not hard to break it down once you know that the cycle of activity begins with the CPU+disk activity of generating the ledger snapshot.