Pool nodes spike CPU every 72 minutes: normal? common? correctable?

COSDpool · 20 July 2021 18:00

We’re running a well performing stake pool for nearly a year now but something has always bothered me which I have never been able to research properly: all our pool nodes spike their CPU use (also raising load average) exactly every 72 minutes.

Since it’s always happened on our core as well as our relays, I’ve understood that it isn’t a response to intrusion. In the early days the documentation on cardano-node was pretty sketchy, and it’s filled out a lot since then: but not with respect to internals, memory usage, debugging and tracing settings, etc… and I’ve been waiting in vain for an explanation of our observed 72 minute cycle.

What have I been missing all this time? I’ve noticed the node resource usage has been stepping up a little bit, probably from all the scaffolding coming in for Alonzo… our pool nodes have been well within performance limits during the quiet part of each 72 minute cycle, but are now pushing those limits during the several-minute spikes.

We’d just like to be well prepared for the coming weeks & hope some of the devs, dev oriented SPOs, or people with relevant empirical observations might share their insights. Before I begin the usual analysis & performance logging I would please just like to hear some others chime in, so we might have an idea what to look at first & even whether or not this is correctable (or even undesirable).

Vadimus · 21 July 2021 15:09

We’ve seen CPU cyclical spikes before. For us, this was due to the circular freeing of swap space on the disk, after which we decided to abandon swap and increase the RAM memory. Hope this helps you.

COSDpool · 26 July 2021 18:52

thanks @Vadimus … after some preliminary investigation & testing I’ve posted this official request:

github.com/input-output-hk/cardano-node

[BUG] - 72 minute CPU spike cycle: reason, and way to spread it out?

opened 06:45PM - 26 Jul 21 UTC

rphair

bug

**Internal/External** - *External* **Area** - *Other* **Summary** I assum…e this performance anomaly is well known and perhaps even by design. Therefore this is not a "bug" report but a "help" request asking: 1. what is the cause of a 72-minute cycle of "spiking" CPU use we & others have seen on servers running `cardano-node` on which real memory might be considered scarce; 2. whether any configuration or compilation settings can be changed in `cardano-node` to spread this "spike" of kernel CPU (apparently from virtual memory processing) out over a longer period, instead of overwhelming the CPU for 1 or 2 minutes out of every 72 minutes. **Steps to reproduce** Our BP and both relays have entirely default configurations (we have only increased `MaxConcurrencyDeadline` and turned off `TraceMempool`) on unremarkable VPS hosting, each with dual CPUs and 4 GB RAM. These have always had spikes in CPU load every 72 minutes, which for a peak minute or so of each 72 minute cycle delays the responsiveness of the affected node enough to show connection timeout errors in the `cardano-node` log files on our other stake pool nodes. The period of CPU spiking never deviates from exactly 72 minutes, although the problem does not appear in the few hours after rebooting while the amount of RAM + the amount of swap used on the system approaches a total of 8 GB... after which the 72 minute spike cycle begins: ![Screenshot from 2021-07-26 23-34-02](https://user-images.githubusercontent.com/2572798/127039921-932ef683-308b-4c9b-8ea8-8717bc5f0b2a.png) **Expected behaviour** and attempted workarounds An integration of the curves above would show we're using, in total, at most 5% of the CPU. From this point of view the servers are UNDERutilitised and this is not a capacity deficiency but rather an efficiency problem that we hope is correctable. Based on looking through Ubuntu, Linux kernel, UNIX VM, and our cloud hosting provider's documentation, I cannot find anything with a period of 72 minutes so have to assume this period is established by `cardano-node` at least in our operating environment. Rebooting the server reduces the amount of `cardano-node` pages that are apparently also being kept in swap space. In trying to find SPO empirical observations & solutions to the problem I posted this, with the only response suggesting it was an inefficient use of swap space which needed to be cleaned up every so often: https://forum.cardano.org/t/pool-nodes-spike-cpu-every-72-minutes-normal-common-correctable/67230 In accordance with this parameter set by the IOHK Nix CI environment we ALSO tried setting `vm.swappiness` to `0` to eliminate that redundancy, which had very little effect: https://github.com/input-output-hk/ci-ops/blob/master/modules/common.nix **Additional context** As it says in our help request part (2), we would like to wait for a developer response before we "throw money at the problem" by migrating to server configurations with either double or quadruple the amount of RAM. In fact, without an answer to part (1), we'd have no assurance this problem would not occur on `cardano-node` servers of any RAM size. There I would like to please hear some tentative answers to (1) before simply being told that our stake pool RAM allocation is lower than others, or lower than any particular recommendation. We are expecting to increase the capacity of our servers regardless, but we would prefer to go about this as engineers would. **System info** ``` Ubuntu 20.04 - 2 vCPU - 4GB RAM (host: DigitalOcean) cardano-node 1.27.0 - linux-x86_64 - ghc-8.10 - git rev 8fe46140a52810b6ca456be01d652ca08fe730bf ```

COSDpool · 30 August 2021 13:17

This observed cycle is from the ledger snapshot alone, currently taken by default at a 72 minute interval, set in the code by DefaultSnapshotInterval. The interval is adjustable in the config.json file with SnapshotInterval, measured in seconds. (One of the more cooperative IOG devs stepped in just now to the thread above to tell us this.)

Under some conditions even copying files, let alone writing the ledger snapshot, will use a lot of swap… I believe this is because file pages are first put in the buffer cache before writing them out to disk. On thinly provisioned servers, this pushes colder cardano-node memory pages out to swap even if you have vm.swappiness set to 0 like we have on our servers.

Then as @Vadimus said the kernel needs to come back and herd the swapped out pages back into memory. This is how the ledger snapshot can cause a cascade of CPU use. It’s not hard to break it down once you know that the cycle of activity begins with the CPU+disk activity of generating the ledger snapshot.

RickCADA · 8 June 2022 06:55

Would one be bale to change this interval? i.e. shorten the time in an attempt to reduce the size and impact on resources?

weebl2000 · 9 June 2022 15:07

Yes, check out this thread:

stakeside · 17 September 2022 14:37

Will it make a difference if I change the interval time shorter?

weebl2000 · 1 October 2022 10:58

It’s not really useful to set the interval even shorter.

Topic		Replies	Views
SlotsMissedNum_int Operate a Stake Pool	4	706	30 December 2021
Is my BP node running normal? Setup a Stake Pool	10	772	21 September 2022
Ups and downs of performance Operate a Stake Pool	10	540	18 August 2021
Speed up BP and relay nodes Operate a Stake Pool	1	832	19 August 2021
Version 1.28 happily minted first block Operate a Stake Pool	15	924	4 August 2021

Pool nodes spike CPU every 72 minutes: normal? common? correctable?

Related topics