Upgraded to 1.29.0 - High memory usage

Hi fellow SPOs,

Upgraded all nodes to 1.29.0 yesterday. After approx 3 hours of running, I’ve finally encountered the much dreaded alert from my system telling me that the relay nodes are running out of memory (around 1.5% left). Nodes were able to run on previous version (1.27.0) without burning as much.

These are the configs:

  • Producer and Relay nodes run on 8GB ram
  • “TraceMempool” is set to false in relay node
  • Relay node memory consumption is around 86% - 90% while producer node is around 81% - 85%. The increase is likely due to relay having more peers to connect to.

With Alonzo upgrade around the corner, I fear that 8GB ram is not going to be sufficient. So fellow SPOs, what are there some of the memory optimization tips that you have used?

I’m aware of the following discussions:

Or do we resign to the fact that 16GB is the new minimum…

1 Like

Thank you for the early feedback. I was just pondering whether to conduct a test!

Although we’re ready with > 150 GB RAM of spare capacity for our bare metal vms. I am very curious as to how many nodes and SPOs currently hosted by the major Cloud providers won’t be able to make the cut due to the higher running costs; and how many RPi SPOs might not be able to upgrade hardware.

If anyone can provide updated links for that sort of info, I would greatly appreciate it. :slight_smile:

Edit: My purely speculative thoughts are that by the end of Basho, the network will be a LOT closer to that 500 node target than it is today. I am thinking around half what there is today.

I have two relays running 1.29.0 I setup this morning. One is using 10.8/24 GB memory, the ohther is suing 18.7/24 GB of memory. The core node is using 18.9/32 GB. My Xen server has 256 gig of ram so I have plenty. I couldn’t afford to do this in the cloud. Comcast Business class internet service is solid so I can run my own at home. Note, all three servers are also using zfs to replicate/backup. This could be a big chunk of that memory. bpytop reports each node is using 7 - 8 GB of memory.

Sidenote, I keep forgetting this. I moved my core node to a Dell PC with 32 GB. It’s was the only way I could get it to stop racking up missed slots. Cardano is the only thing on that particular server. I also haven’t yet tried any memory optimizations.

Nice, your setup is definitely future proof. Are you running this at home?

And yes, I’m afraid that you’re right. SPOs on cloud will be eliminated

Nice. Are you running these at home?

Our ticker is 001. You can read all about it in the blog on our website. :slight_smile:

Mouahahaha, probably bad pools in the cloud yes. But there are still plenty of proper way to do it guys.
Bring it on, 16g is cheesecake ! aarch64 architecture FTW :wink:
All joke appart, what I say is still very true, double the cloud cost and it is not so expensive.
Bare Metal have to

  • pay for internet (that can fail easily, single connection from a single provider most probably)
  • pay electricity (same here, you might and potentially will have an outage one day)
  • and finally pay the cherished bare metal single server (definitively not HA HW speaking, but not the only SPOF as seen).

Any way that’s not the subject, but trust me, a properly done pool (on AWS for me), is definitely not expensive to run and so much more professional than a home server.
I’m saying all this from a single, not rich, pool that refuse to do dumb and repetitive marketing on social media.

If it really chew more than 8g the ones that should worry are the raspberry pie nodes imho …

2 Likes

Hello, I wouldn’t say “new minimum is 16GB”, memory doesn’t have to be multiple of 2 :slight_smile: I’d say that 10GB should be the new minimum.

Having said that - I don’t think bare metal servers are cheaper at all, quite the opposite, we are bare metal HA cluster (3 servers), but also housed in professional datacenter with redundant internet and power. It’s far more expensive than cloud provider, but it brings some advantages, such as it helps decentralize Cardano even more - in catastrophic scenario when whole cloud provider fails (hello OVH) or when provider decides to ban whole technology (hello Parler) dedicated bare metal servers would prevail.

BTW speaking of memory, we upgraded to 1.29 even on mainnet and memory usage seems identical to what it was before:

relays - ~8GB used
pn - ~6GB used

But it might increase with Alonzo fork. Keep in mind that relay memory usage drastically increases with too many nodes configured in topology. AFAIK recommended is about 20 nodes in topology, we have 25.

Out of curiosity, what is the memory breakdown on a 1.29.0 node ?
I mean used / cache / buffer / free ? Just a rough idea it doesn’t have to be precise.

From kernel point of view? I don’t think that’ so useful information as most of the cardano-node process memory is resident (active / inactive), but here you are - red line is upgrade from 1.27 to 1.29

Thanks mate, yeah not much of a change from what I can see.
I’m starting a test POC in 1.29.0 today, will let know if I have anything interesting.

Yes, I confirm that 1.29.0 has much higher memory consumption.
The core runs at 9Gb
The relays run at around 10Gb

Also, the missed slots number is way higher during the epoch changeover.

That’s interesting, did you try restarting the cardano-node again? I remember sometimes it “starts weird” and consumes way more memory than usual, restart typically brings it back down to normal levels. Also changes to topology on relay may have effect on memory usage (especially if you increased amount of active outgoing connections, even replacing dead nodes with live ones may result in memory increase).

Here is updated graph with memory consumption for last 2 days, red line is upgrade to 1.29, you can see it’s pretty much same. Since upgrade to 1.29 we had 0 missed slots (about 22 hours now), but we updated after epoch boundary.

FYI that VM now has 10GB, but we can hot-add memory easily

Stats for our relay after a few hours on 1.29.0. Allocated 12 GB RAM

image

Looks like anything with 8 GB or less is going to be swapping more.

FWIW, this relay has 23 outbound.

Seeing similar numbers with a similar setup. 3 relays tracking mid to high 80’s for memory usage with the bp tracking mid to high 70’s. All with 8GB of RAM. Does seem higher than 1.27.0. This is in the cloud. May have a to add a couple of GB’s to all just for peace of mind, which is a few more dollars a month.

Glad you posted this or I may not have noticed. Was thinking of switching the producer from cloud to an RPi, but that idea might have to go out the window now for a bare metal setup instead.

Well - I’m running my relay node on a 16 GB server and it has 1,4 GB free memory. With the monitoring, cncli jobs, etc. it adds up.

1 Like

Memory is always low after a restart. Then it starts caching and increasing.
It always starts at around 4Gb for both the core and relays. Then eventually stabilises at around 8 for 1.28.0 and around 9-10 for 1.29.0
That’s what it is for me anyway…
The mem stats right now:
relay 9.84GB
spare 10.66GB
core 8.981GB

0 missed slots during the epoch - correct. Same here.
Increased number of missed slots during the epoch changeover though.

Interesting…
Just upgraded to the latest commit of 1.29.0.
Mmemory consumption is back to normal around 6.5Gb
Go figure