Missed Slot Leader %

I’ve been struggling to get Missed slot leader #s down.
Followed How to set up a POOL in a few minutes - and register using CNTOOLS

Have read/tried pretty much all I could find on the boards in regards to this.

I’ve got 3 servers on Contabo with below stats.
Setting clientloglimit 100000000 has been latest entry, it appears to have lessened the misses (they where typicall as high as 2%) but still incrementing. I’ve been researching timing because one NODE shows more time sync errors than other 2 (setup same on all 3).
Any other suggestions or these are unavoidable to a degree?

My latest change was to chrony, adding clientloglimit 100000000

WARNING: you should run this program as super-user.

H/W path     Device  Class      Description
===========================================
                     system     Computer
/0                   bus        Motherboard
/0/0                 memory     29GiB System memory
/0/1                 processor  Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
/0/100               bridge     440FX - 82441FX PMC [Natoma]
/0/100/1             bridge     82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1           storage    82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.2           bus        82371SB PIIX3 USB [Natoma/Triton II]
/0/100/1.3           bridge     82371AB/EB/MB PIIX4 ACPI
/0/100/2             display    VGA compatible controller
/0/100/3             generic    Virtio memory balloon
/0/100/3/0           generic    Virtual I/O device
/0/100/5             storage    Virtio SCSI
/0/100/5/0           generic    Virtual I/O device
/0/100/12            network    Virtio network device
/0/100/12/0  eth0    network    Ethernet interface
/0/100/1e            bridge     QEMU PCI-PCI bridge
/0/100/1f            bridge     QEMU PCI-PCI bridge

pool time.google.com       iburst minpoll 2 maxpoll 2 maxsources 3 maxdelay 0.3
pool time.facebook.com     iburst minpoll 2 maxpoll 2 maxsources 3 maxdelay 0.3
pool time.apple.com        iburst minpoll 2 maxpoll 2 maxsources 3 maxdelay 0.3
pool ntp.ubuntu.com        iburst minpoll 2 maxpoll 2 maxsources 3 maxdelay 0.3

This directive specify the location of the file containing ID/key pairs for

NTP authentication.

keyfile /etc/chrony/chrony.keys

This directive specify the file into which chronyd will store the rate

information.

driftfile /var/lib/chrony/chrony.drift

Uncomment the following line to turn logging on.

#log tracking measurements statistics

Log files location.

logdir /var/log/chrony

Stop bad estimates upsetting machine clock.

maxupdateskew 5.0

This directive enables kernel synchronisation (every 11 minutes) of the

real-time clock. Note that it can’t be used along with the ‘rtcfile’ directive.

rtcsync

Step the system clock instead of slewing it if the adjustment is larger than

one second, but only in the first three clock updates.

makestep 0.1 -1

Get TAI-UTC offset and leap seconds from the system tz database

leapsectz right/UTC

Serve time even if not synchronized to a time source.

local stratum 10
clientloglimit 100000000
Untitled

Hi,

Can be related (missed slots) with servers which are running on VPS services… I had them before to migrate my BP to dedicated/bare metal server but never lost a block.

Time synchronization looks ok (my BP has 15 ms)…
You can try to edit cnode.sh script on all nodes (+ restart) and check again; uncomment the lines

CPU_CORES=4
MEMPOOL_BYTES=8388608 

Cheers,

Forgot to mention I did set CPU_CORES to 8 (8 core on Contabo), then tried 6.
Now set to 4 (is that the max it should be set to?).
Uncommented MEMPOOL, we’ll see.
Thanks Alex.

2 Likes

Contabo is the problem for your missed slot leader checks. I don’t think you can find a worse hosting.

If you are using a VPS then you are sharing CPU resources with other users. This means that if your node does not get scheduled a CPU right at the start of each new slot then it will skip doing that slot leadership check.

If you use dedicated hardware then you can ensure that some CPUs are dedicated to your block producer so that they are always available. Nevertheless, you can still get missed leadership checks when Haskell does its garbage collection memory copying because this also can tie up the processor at the wrong time.

One way around the garbage collection cause is to then run your node with RTS settings that make the Haskell executable use the “nonmoving” garbage collector: “+RTS -N --nonmoving-gc -RTS”. But this has other problems: Using the nonmoving-gc results in freed memory not being returned to the OS properly so then you need to restart your node every day or so. Arrrgh!

I wish IOG would upgrade the code base to compile under ghc version 9.2.x instead of the older 8.10.7 because apparently the Hasekll nonmoving-gc works properly in 9.2.x or later.

I run my block producer on a low power machine and I get no missed leadership checks at all when running with the nonmoving-gc. I just restart often.

Others on this forum have different opinions on how to minimise missed leadership checks. If you do some searches on this forum you will find numerous posts about the problem.

Do NOT uncomment the MEMPOOL, and when it comes to Contabo you could have 1000 vCPUs,
IF the host-node is overshared you get nothing.

Contabo uses Desktop-PC (so no real server hardware) and that hardware is overshared on top of that.
So, basically, you can be lucky and end up on a new host-server and get some resources, but this being lucky ends the moment the host-server gets filled up with customers. And when all other customers have high loads, while you run your nodes, then your VPS just gets like nothing at all.

Imagine you took a VPS from them with 8 vCPUs, which are basically CPU-Threads, and on an overshared host-server you maybe get 5% out of 1 CPU-Thread, that means with all your 8 vCPUs you get maybe around ~40% of 1 CPU thread,…

So long story short, if you want to run your Pool seriously, do NOT use Contabo.

It’s like crossing the ocean in a plane fully made out of paper on a rainy day - good luck!

1 Like

come on! Seriouslly? My relay is on contabo (VPS) and I don’t have any issues… maybe its not the best provider but for a small pool the price is more than ok

I can see the differences (CPU) only when I’m building/compiling the node… but to mint blocks there is no need for powerful CPU

2 Likes

I am also using Contabo and never had a problem. For the block producer I am using VDS (S). Having dedicated CPUs for that particular time constrained job is good enough. Together with some tweeking Haskell runtime, missed slot leader is no problem.

How do you come to that conclusion? Never had any problems with Contabo. IMO they have excellent network performance.

This is not true. And I’ve never had the experience you call “overshared”. Whatever your problems you have with Contabo - spreading FUD here is - IMO not ok.

1 Like

I agree. I haven’t found it too bad. I run an extra relay on Contabo and it has been OK. I run my block producer on bare metal.

This is the same for all VPS - as it follows the definition of VPS. If you want bare metal you can order that from Contabo as well. If you order VPS from any of the other hosting service provider you don’t get bare metal either.

I, for myself, did go for VDS for the block producer and had never a missed scheduled block. Loosing a height battle does not count here.

A cardano node does not need a lot of bandwidth, my nodes are using less than 1 Mbps in average. Latency is very important, of course, and you need more bandwidth to transfer the blocks quickly, but having “good network performance” is easy. But good CPU and disk performance are more difficult to provide.
Just take a look at the “stolen” CPU here in this vmstat output, which is from a Contabo VPS with 32 GB RAM struggling to run a cardano-node:
ctb1_vmstat_stolen

I seem to be getting better results on my Contabo VPS relay with 16G Ram and 6 cpus:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache     si   so    bi    bo   in   cs  us sy id wa st
 0  0    780 5222196   7064 2834084    0    0     7    47    1    3  2  1  97  0  1
 0  0    780 5222228   7072 2834076    0    0     0   228  244  373  0  0 100  0  0
 1  0    780 5222068   7072 2834084    0    0     0   316  447  638  1  0  99  0  0
 0  0    780 5222036   7072 2834084    0    0     0     0  326  452  0  0 100  0  0
 0  0    780 5222036   7072 2834084    0    0     0     0  166  287  0  0 100  0  0
 0  0    780 5222036   7072 2834084    0    0     0     0  140  247  0  0 100  0  0
 0  0    780 5222004   7072 2834084    0    0     0     0  139  243  0  0 100  0  0
 0  0    780 5222036   7080 2834076    0    0     0    36  131  219  0  0 100  0  0
 0  0    780 5222036   7080 2834084    0    0     0     0  215  368  0  0  99  0  0
 0  0    780 5222068   7080 2834084    0    0     0     0  439  689  1  0  99  0  0
 0  0    780 5222004   7080 2834084    0    0     0     0  784 1138  1  0  99  0  0
 0  0    780 5222004   7080 2834084    0    0     0     0  147  260  0  0 100  0  0
 0  0    780 5220684   7080 2834084    0    0     0    32  186  259  0  0 100  0  0
 0  0    780 5220684   7080 2834084    0    0     0     0  160  283  0  0 100  0  0
 0  0    780 5220772   7080 2834084    0    0     0     0  518 4180  2  0  96  0  1
 0  0    780 5220740   7080 2834084    0    0     0     0  358  514  0  0  99  0  0
 0  0    780 5220708   7088 2834084    0    0     0    16  518  734  2  0  97  0  1
 0  0    780 5220708   7088 2834084    0    0     0     0  256  376  0  0 100  0  0
 0  0    780 5220408   7088 2834156    0    0     0     0 1134 1632  3  1  95  0  1
 0  0    780 5220488   7088 2834164    0    0     0     0  236  350  0  0 100  0  0

There is some “stolen” cpu time but not like in your output.

This is for my block producer on “bare metal”, dedicated CPUs:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache      si   so    bi    bo   in   cs us sy id wa st
 0  1  81140 2673184     20 13574640    3   10   250   113    6    7  3  1 97  0  0
 0  1  81140 2673208     20 13574712    0    0     0   184  153  234  1  0 98  2  0
 0  0  81140 2673208     20 13574712    0    0     0     0  202  323  1  0 99  0  0
 0  0  81140 2673208     20 13574712    0    0     0     0  204  323  1  0 99  0  0
 0  0  81140 2673208     20 13574712    0    0     0     0  243  409  1  0 99  0  0
 0  0  81140 2673208     20 13574712    0    0     0     0  388  551  2  0 98  0  0
 0  0  81140 2673240     20 13574712    0    0     0     0 1635 3149  3  1 96  0  0
 0  0  81140 2673240     20 13574712    0    0     0     0  150  210  1  0 99  0  0
 0  0  81140 2673176     20 13574712    0    0     0     0  173  276  1  0 99  0  0
 0  0  81140 2669644     20 13574712    0    0     0     0  198  255  1  0 99  0  0
 0  0  81140 2669676     20 13574712    0    0     0     0  252  338  2  0 98  0  0
 0  0  81140 2669644     20 13574712    0    0     0     0  274  402  1  0 99  0  0
 0  0  81140 2669644     20 13574712    0    0     0     0  266  402  1  0 99  0  0
 0  0  81140 2669644     20 13574712    0    0     0    88  468  684  3  0 96  1  0
 0  0  81140 2667628     20 13574824    0    0     0   388  654  797  5  1 93  1  0
 0  0  81140 2667628     20 13574824    0    0     0     0  143  214  1  0 99  0  0
 0  0  81140 2667660     20 13574824    0    0     0     0 1172 2239  2  1 97  0  0
 0  0  81140 2667660     20 13574824    0    0     0     0  173  264  1  0 99  0  0
 0  0  81140 2667660     20 13574824    0    0     0    96  243  354  1  0 99  0  0
 0  0  81140 2664160     20 13574824    0    0     0     0  274  357  1  0 99  0  0
1 Like

You are probably lucky, at least for now. Anyway, I tested many hosting providers, and Contabo was the only one where I saw stolen CPU. And most of the people complaining about many missed slot leader checks are using Contabo. For other applications, Contabo might be ok, but I would prefer any other hosting for a cardano node, after what I’ve heard and seen about them.

There’s a simple solution: go ahead for a bare metal host. That’s also available from Contabo and you can be sure it’s your machine only. I, for myself went for VDS which does not share CPU’s, thus being able to produce a block in time. When I started operations I did spend quite some time (actually weeks) to optimize (reduce) missed slot leader checks. After going VDS and optimizing Haskell runtime parameters, the bp was fine and I did not miss any scheduled block since then.

I’m going to challenge that; actually it does happen to all VPS, because that is the nature of VPS. You don’t get dedicated CPU’s - hence the name virtual in VPS.

Over and over I’ve seen people using AWS, Azure, and other services having their nodes crashing because of limited resources which includes RAM as well. IMO Contabo sells what they promise - not more not less. So far, I haven’t found any VPS hosting services with guaranteed real time performance.

Don’t get mad at me, but IMO it’s not the fault of Contabo, that you did not do your homework to set up your system properly. Others did, and by the number of Contabo hosted Cardano pools they seem to be successful.

And, btw., I’m not in any way affiliated with Contabo, and if you don’t like their services then use some other one. But I don’t like bad-mouthing some company because of your own mistakes.

No, it does not happen to any VPS. I tested many providers. Try Hetzner and see the quality. There are also other provider a lot more expensive which also have excellent quality. I also have a few VPSes on rented hardware in a DC and there are no missed slot leader checks.
An I have also bare metal servers with SSD disks where I see missed slot leader checks.

What is my mistake, that I tried to run on Contabo the same thing I run in other providers without any issues?