Upgraded all nodes to 1.29.0 yesterday. After approx 3 hours of running, I’ve finally encountered the much dreaded alert from my system telling me that the relay nodes are running out of memory (around 1.5% left). Nodes were able to run on previous version (1.27.0) without burning as much.
These are the configs:
Producer and Relay nodes run on 8GB ram
“TraceMempool” is set to false in relay node
Relay node memory consumption is around 86% - 90% while producer node is around 81% - 85%. The increase is likely due to relay having more peers to connect to.
With Alonzo upgrade around the corner, I fear that 8GB ram is not going to be sufficient. So fellow SPOs, what are there some of the memory optimization tips that you have used?
Thank you for the early feedback. I was just pondering whether to conduct a test!
Although we’re ready with > 150 GB RAM of spare capacity for our bare metal vms. I am very curious as to how many nodes and SPOs currently hosted by the major Cloud providers won’t be able to make the cut due to the higher running costs; and how many RPi SPOs might not be able to upgrade hardware.
If anyone can provide updated links for that sort of info, I would greatly appreciate it.
Edit: My purely speculative thoughts are that by the end of Basho, the network will be a LOT closer to that 500 node target than it is today. I am thinking around half what there is today.
I have two relays running 1.29.0 I setup this morning. One is using 10.8/24 GB memory, the ohther is suing 18.7/24 GB of memory. The core node is using 18.9/32 GB. My Xen server has 256 gig of ram so I have plenty. I couldn’t afford to do this in the cloud. Comcast Business class internet service is solid so I can run my own at home. Note, all three servers are also using zfs to replicate/backup. This could be a big chunk of that memory. bpytop reports each node is using 7 - 8 GB of memory.
Sidenote, I keep forgetting this. I moved my core node to a Dell PC with 32 GB. It’s was the only way I could get it to stop racking up missed slots. Cardano is the only thing on that particular server. I also haven’t yet tried any memory optimizations.
Mouahahaha, probably bad pools in the cloud yes. But there are still plenty of proper way to do it guys.
Bring it on, 16g is cheesecake ! aarch64 architecture FTW
All joke appart, what I say is still very true, double the cloud cost and it is not so expensive.
Bare Metal have to
pay for internet (that can fail easily, single connection from a single provider most probably)
pay electricity (same here, you might and potentially will have an outage one day)
and finally pay the cherished bare metal single server (definitively not HA HW speaking, but not the only SPOF as seen).
Any way that’s not the subject, but trust me, a properly done pool (on AWS for me), is definitely not expensive to run and so much more professional than a home server.
I’m saying all this from a single, not rich, pool that refuse to do dumb and repetitive marketing on social media.
If it really chew more than 8g the ones that should worry are the raspberry pie nodes imho …
Hello, I wouldn’t say “new minimum is 16GB”, memory doesn’t have to be multiple of 2 I’d say that 10GB should be the new minimum.
Having said that - I don’t think bare metal servers are cheaper at all, quite the opposite, we are bare metal HA cluster (3 servers), but also housed in professional datacenter with redundant internet and power. It’s far more expensive than cloud provider, but it brings some advantages, such as it helps decentralize Cardano even more - in catastrophic scenario when whole cloud provider fails (hello OVH) or when provider decides to ban whole technology (hello Parler) dedicated bare metal servers would prevail.
BTW speaking of memory, we upgraded to 1.29 even on mainnet and memory usage seems identical to what it was before:
relays - ~8GB used
pn - ~6GB used
But it might increase with Alonzo fork. Keep in mind that relay memory usage drastically increases with too many nodes configured in topology. AFAIK recommended is about 20 nodes in topology, we have 25.
Out of curiosity, what is the memory breakdown on a 1.29.0 node ?
I mean used / cache / buffer / free ? Just a rough idea it doesn’t have to be precise.
From kernel point of view? I don’t think that’ so useful information as most of the cardano-node process memory is resident (active / inactive), but here you are - red line is upgrade from 1.27 to 1.29
That’s interesting, did you try restarting the cardano-node again? I remember sometimes it “starts weird” and consumes way more memory than usual, restart typically brings it back down to normal levels. Also changes to topology on relay may have effect on memory usage (especially if you increased amount of active outgoing connections, even replacing dead nodes with live ones may result in memory increase).
Here is updated graph with memory consumption for last 2 days, red line is upgrade to 1.29, you can see it’s pretty much same. Since upgrade to 1.29 we had 0 missed slots (about 22 hours now), but we updated after epoch boundary.
Seeing similar numbers with a similar setup. 3 relays tracking mid to high 80’s for memory usage with the bp tracking mid to high 70’s. All with 8GB of RAM. Does seem higher than 1.27.0. This is in the cloud. May have a to add a couple of GB’s to all just for peace of mind, which is a few more dollars a month.
Glad you posted this or I may not have noticed. Was thinking of switching the producer from cloud to an RPi, but that idea might have to go out the window now for a bare metal setup instead.
Memory is always low after a restart. Then it starts caching and increasing.
It always starts at around 4Gb for both the core and relays. Then eventually stabilises at around 8 for 1.28.0 and around 9-10 for 1.29.0
That’s what it is for me anyway…
The mem stats right now:
relay 9.84GB
spare 10.66GB
core 8.981GB
0 missed slots during the epoch - correct. Same here.
Increased number of missed slots during the epoch changeover though.