BP is missing slots since upgrading to 1.27.0

VMelchi · 4 July 2021 18:44

Dear Cardano Community,

My BP is currently missing slots at an average rate of 2.5 slots per hour since I upgraded to 1.27.0.

I already went through several threads on the Cardano forum but I couldn’t find any conclusive solution to this issue. As suggested in many related topics, I moved my BP from a VPS to a bare metal server and increased the RAM of the hardware from 8GB to 16GB. However, I’m still missing slots at a similar rate like before.

My current setup is:

BP → dedicated server at home: 4 cores, 16GB RAM, 15GB swap and 256GB SSD (Zurich, Switzerland)
Relay 1 → dedicated server at home: 4 cores, 8GB RAM, 7GB swap and 256GB SSD (Zurich, Switzerland)
Relay 2 (Backup) → VPS on data center: 4 cores, 8GB RAM, 7GB swap and 80GB SSD (Western Germany)
Internet connection: 1GB up/down

If I run the command I journalctl -e -f -u cnode.service on my BP and relays, I don’t see any issues related to insufficient CPU like cnode.service being terminated with signal SIGKILL, for instance.

I’m a SPO and my pool is extremely small (85k live stake). So I want to lower the chances of missing any of the few blocks that will be assigned to my pool as much as I can.

Is this a known issue? Am I missing something in my setup? Do I need to change any ENV parameters to improve my BP’s performance?

Many thanks in advance for the help!

vignackz · 6 July 2021 09:57

Hi, we have a very similar pool setup and stake, and same issues, I partially solved them by upgrading BP CPU from 2C to 4C.
My missed slots output lowered by a lot but they’re still present. Since I was not scheduled for any blocks, last week I was strictly monitoring my setup and i found that the old 2C BP -now acting as relay- is sometimes missing on new BP connections, and this is happening quite frequently. And when this happens, the chain tip difference on BP starts increasing.
I’ll try and switch off this relay to see if it’s the responsible for all the old and new issues.

Next try I’m doing is connecting BP directly to my router instead of being connected to the “stakepool switch” with other relays, to minimize latency as much as possible and avoid relays to steal any bandwidth to the BP through the honestly-quite-old switch they’re all now connected to…

To save RAM, i set Tracemempool to FALSE on BP, you might consider that to save some RAM if you haven’t done it yet. I was considering to increase RAM but reading from you and others, that should hardly be the issue…i’ll let you know!

VMelchi · 6 July 2021 11:14

Hola! Thank you for your answer. I was also considering to upgrade the RAM for my relay on a data center to 16GB and check if that helps. Although, I think my CPU activity is looking quite stable on all of my nodes (between 3% and 8% in the last 5 days), but for the sake of the investigation, I’ll set the TraceMempool option to FALSE on the config.json for my BP and for one of my relays and monitor if that helps a little bit. How are you currently checking the latency on your nodes? I had that idea of connecting my servers at home directly to my router as well. Unfortunately, this didn’t help at all. How many slots are you currently missing per hour approximately?

vignackz · 11 July 2021 07:07

I might have found the solution, or at least a solution that’s perfectly working for me, and that involved my network architecture.
Originally, all stakepool traffic was prioritized on my router and sent to the unmanaged stakepool switch.
So that was my clue: BP has 3 connections to my 3 relays (2 baremetal, 1 cloud), while the 2 baremetal relays have tenths of connections each. All of those are prioritized by router above all other internet traffic, but when they come to the switch it has to manage around 70 connections with no priority capabilities, BP’s connections get queued by some physical electronic part and BP’s priority is eventually lost, causing missed slots.

So I made a new cable from BP to router and restarted the node just before epoch change. Results? 0 missed slots until epoch change, 182 on epoch change (that’s fine from what I read), all of them missed in the first hour after epoch change, and no more since then!! Finally!!!

Just my 2 cents suggestions for baremetal setup are then:

connect BP to proven managed network hardware to manage priority in depth, or
connect BP to router on dedicated cable, no switches in the middle
relays seem to be not affected from any network point you plug them into

Hope to be useful for someone else, let me know in case!
Vig@VICISpool

VMelchi · 12 July 2021 16:46

Thank you for sharing and I’m glad that you found a solution! I’ll try this out. Setting the TraceMempool option to FALSE definitely improved the situation for me as well. Although, I’ll probably have to upgrade my current router because it doesn’t allow me to prioritize the traffic on my ports. I’ll keep this thread posted as soon as I have relevant results.

mini1pool · 3 August 2021 12:17

I notice the same behavior on my nodes as well. They are all running on Raspberry Pi’s 8GB at my house.
I also doubt that this has to do with low memory performance, as my nodes are not running at their limit. In average the RAM consumption ranges between 55% - 68%, which is in an acceptable range I would say.
Will do some further investigation and inform you about any findings which might be interesting for you as well.

ADA4Good · 3 August 2021 13:02

You Might want to check this out:

VMelchi · 3 August 2021 13:44

@ADA4Good Thank you so much for sharing this great article! I applied this fix right away . I will flag this as the solution after monitoring my nodes for a couple of days to confirm that the issue was properly solved.

ADA4Good · 3 August 2021 13:56

You are welcome. Also note that our alliance is still working to understand the situation better. We don’t think this is the “finale solution”. Follow the other thread mentioned above to stay up to date.

mini1pool · 3 August 2021 14:09

Yeah, thank you for sharing it. So if I understand, the bottom-line of the post is to start the cardano-node in this way: cardano-node +RTS -N …run" in order to use all 4 cores.

I don’t want to be the partycrasher, but I’m running my node since more than 3 weeks with this configuration here (for other memory optimization reasons) ’ cardano-node +RTS -N4 --disable-delayed-os-memory-return** -qg -qb -c -RTS run" , and I still have around 10-20 missed slots/hour.

I’m interested to hear more about your experience.

ADA4Good · 3 August 2021 14:16

What we learned (for now) with our experiments is that whenever one of the cores get to 100% or near 100% you seem to miss a slot. So there should be 2 parameters (basically) to mitigate this:

use more cores
use faster cores.

And this second point is not an insignificant thing. If you use a shared vCore VPS where your CPU’s are heavily shared than you might have 8 cores but they might not be fast enough. In another situation if you have a super fast bare metal with 2 cores assigned, this configuration might perform better than the 8 core shared VPS.

VMelchi · 8 August 2021 17:22

@ADA4Good I’ve closely monitored my nodes in epoch 282 and after applying your recommended fix I had no missing slots for 4 days in a row. It’s worth to mention that I currently have the TraceMempool option in the config.json set to FALSE. After the epoch change to 283 I’ll change this setting to TRUE and will restart my nodes to see if that makes a difference. Just 8 missing slots randomly appeared yesterday but this might have occurred due to the heavy memory usage in the network (perhaps caused by the rising prices of the crypto market). I’ve marked your answer as the solution since this was the closest I’ve ever been from solving this annoying issue. Thanks again for the amazing input!

CheffWallets · 9 August 2021 14:34

this is very annoying, we can’t find a definitive solution for these “missing slots”
I’m running on a VPS, I have no way or means to do the solutions you are doing.
At same time there’s no check list for a pool to see why its not mining blocks.

VMelchi · 10 August 2021 16:46

@CheffWallets you can still try the solution recommended by ADA4Good. Data centers that offer VPS services usually include a certain amount of processors for computing. If your BP is running on a VPS with 4 cores, for example, cnode will just use 2 cores due to its’ default setting, so changing this setting and adding a Swap to your server (in case that you have less than 16GB RAM) could make a difference. Bare metal servers are still the best option for the BP node, though. Backups and relays can run on a VPS without troubles. If your pool is not minting any blocks, it could be due to insufficient active stake: my pool is quite small (~100k ADA) and I’ve gotten just 3 blocks in the last 6 months. You can check how many slots your BP is currently missing with this command: curl localhost:12798/metrics | grep "cardano_node_metrics_slotsMissedNum_int" You can also run the gLiveView.sh script to see if you have missed any block. This info will be also shown on the mentioned command if your remove the pipe and the second part of it. In case that you are not missing more than 3 slots per hour (without considering epoch changes), your KES key is up to date and if your BP is properly configured and connected to your relays (you should see the TX property increasing on the gLiveView), then you are probably just having bad luck with the protocol. So getting more delegators should do the trick. I hope this helps!

CheffWallets · 13 August 2021 17:30

I wasn’t aware that could be so expensive running in a dedicated server… the best I could find would ask me 128 euro month. that’s a lot of money for a pool that doesn’t mine blocks.
GLView doesn’t looks show missed blocks, they are all 0

The response for the command is

curl localhost:12798/metrics | grep "cardano_node_metrics_slotsMissedNum_int"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2008    0  2008    0     0   980k      0 --:--:-- --:--:-- --:--:--  980k
cardano_node_metrics_slotsMissedNum_int 474

Alexd1985 · 13 August 2021 18:00

You can buy with 128$ a mini PC at home for the producer or look for alternative vps server

I used contabo for example (now I have only one relay on cloud for bkp)

VMelchi · 13 August 2021 18:06

I totally understand. You also should consider that if you have a live stake over 900k, you will be earning at least 340 ADA per minted block. So if your pool mints 1 block every month, it will still be profitable at today’s prices. Sometimes you have to invest some to get a return. To avoid these costs, I decided to have my staking pool’s main nodes in my basement. I have the infrastructure for it, though. This decision is cheaper but it also has pros and cons. Having a couple of backup servers on data centers is almost inevitable. Btw. the recommended solution also works if you have the TraceMempool option set on TRUE. This causes (occasionally) some missed slots but it’s not that bad. I decided to turn off the option again to reduce the amount of missing slots to the minimum. Good luck!

Topic		Replies	Views
Missed slots Operate a Stake Pool	59	2579	8 October 2021
BP node missing slots Setup a Stake Pool	40	1452	25 June 2021
SlotsMissedNum_int Operate a Stake Pool	4	706	30 December 2021
Missed slot leader checks Operate a Stake Pool	68	5337	6 September 2022
Dedicated virtual machine for BP node Operate a Stake Pool	8	649	28 July 2021

BP is missing slots since upgrading to 1.27.0

Related topics