BP node dropping metrics on Grafana after upgrading

doc_krieger · 25 June 2021 21:05

unfortunately, no.

Alexd1985 · 25 June 2021 21:12

Not now, when u see gaps in grafana

try journalctl -e -f -u cardano-node | grep kill

doc_krieger · 25 June 2021 21:18

still nothing unfortunately
PS: thanks for all the help, really appreciate it!

doc_krieger · 26 June 2021 05:31

Well I’m out of ideas. Just turned off mempool which fixes the dropped metrics but not a fan of not being able to see tx.

Neo_Spank · 26 June 2021 13:27

Did you manually change anything in Prometheus on your old server that needs to change now on new server? Added scrape targets? Lowered memory? If Prometheus has too little memory or goes into rushed mode it may explain that graph.

Also, is it possible that you bought 16 GB VPS and they gave you less, or it will switch next payment cycle to new VPS config or there is any delay until you get your new configs? Some companies take from 1 hour to 2 days for your upgrade to be fully available.

doc_krieger · 27 June 2021 01:01

I think you might be right. I actually gave up on trying to figure out what was causing the issue and instead decided to do a fresh reinstall of my whole server (made sure ram + cpu configs were actually right this time!). Currently syncing the blockchain and seems to be working well on grafana. Thanks for all the help you two!

doc_krieger · 27 June 2021 16:46

Ugh, nevermind. Looks like its still dropping metrics. The node itself looks like it is working just fine but the I keep seeing gaps in the metrics like before

doc_krieger · 27 June 2021 17:49

Out of curiosity what do you have your scrape interval ad? Mine is the default 15sec. Wondering if I should increase to 30sec and see if it makes a difference

Neo_Spank · 28 June 2021 09:26

I wasn’t thinking it was global global_scrape interval at all. I thought that maybe you added a scraper yourself and it was collecting too much data.

I wouldn’t worry about global_scrape if it’s default. That’s more of an optimization issues. Just try to troubleshoot this issue first.

So, if you did add custom scrapers, remove them and see what happens. You may be ingesting metrics you don’t need. If you want to change scrape_interval, do it in scrape_configs, so you can override global interval for that scraper only. This is just trial and error so we can determine what is (or is not) causing this.

Also, can you share specifics of your upgrade. What it was to what it is now, as per VPS. Does it say 6 Core CPU or vCPU? Are the resources shared?

There is also possibility that your metrics are corrupt and that they don’t represent what is actually happening.

doc_krieger · 28 June 2021 19:09

my current specs according to the VPS provider is 6 vCPU, 16 GB ram, 400 SSD.

Going to take a further look today at the settings but I cant recall adding anything or having a custom scraper for my BP. I only installed the prometheus_node exporter and nothing further

Once I get a sec later today I’ll check again

Neo_Spank · 29 June 2021 07:34

Then it probably isn’t Prometheus.
Can you check with your provider if you are on shared plan? vCPU are time dependent and some VPS providers try to maximize by sharing resources or try to sell at lower price as shared. Those type of servers may not be able to handle persistent BP node. Those drops may be due to others load spiking and you dropping.

try
grep processor /proc/cpuinfo

see what it says.

doc_krieger · 30 June 2021 03:58

Sorry for the delay. Clinic has been crazy busy.

The command shows:

doc_krieger · 30 June 2021 05:04

Just removed Prometheus node exporter on my BP to see if reinstalling it would help but realized that grafana is still pulling metrics even without it installed. Is that normal? Thought the exporter had to be running to get metrics scraped.

Alexd1985 · 30 June 2021 05:22

Check the promotheus.yml file from grafana server and check to which IP is pointing (for BP)

doc_krieger · 30 June 2021 18:42

This is my prometheus.yml:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label job=<job_name> to any timeseries scraped from this config.
  - job_name: 'prometheus'

    static_configs:
      - targets: ['localhost:9100']
      - targets: ['<Block Prod IP>:9100']
      - targets: ['<Block Prod IP>:12798']
        labels:
          alias: 'block-producer-node'
          type:  'cardano-node'
      - targets: ['<relay node 2 IP>:9100']
      - targets: ['<relay node 2 IP>:12798']
        labels:
          alias: 'relaynode2'
          type:  'cardano-node'
      - targets: ['localhost:12798']
        labels:
          alias: 'relaynode1'
          type:  'cardano-node'

Neo_Spank · 1 July 2021 06:03

As @Alexd1985 suggested. Can you confirm that your BP IP from

are your new (upgraded) server IPs and not your old server IPs?

Also, when you check <IP.Monitoring.Server>:9090 in your client browser under Status / Targets what does it show?

doc_krieger · 1 July 2021 16:24

my new upgraded servers have the same IP as before.

Below is the results from the from the <IP.Monitoring.Server>:9090:

Neo_Spank · 2 July 2021 05:24

Try this:
Under :

Add
scrape_timeout: 10s

-and
Under:

add
sample_limit 1000

Check the memory usage metrics after a few hours and post it here.
This is pretty much my last idea for Prometheus, unless you contact Contabo and ask if this is on their end.

doc_krieger · 2 July 2021 16:52

Thanks Neo,

I actually migrated my BP to a new server and wouldn’t you know it the problem is solved. Definitely think you were right that there was something going on with the server though for the life of me I couldn’t figure it out.

Appreciate all the help. I owe you and Alex a drink (or several).

Topic		Replies	Views
1.26.1 and memory usage with traceMemPool Operate a Stake Pool	4	573	14 April 2021
FYI for SPOs - update to 1.26.1 eating up more resources Community Technical Support	1	508	15 April 2021
Is my BP node running normal? Setup a Stake Pool	10	774	21 September 2022
1.26.2 high memory usage Operate a Stake Pool	9	978	25 April 2021
Relay using a lot of CPU, a lot Setup a Stake Pool	34	1357	10 November 2021

BP node dropping metrics on Grafana after upgrading

Related topics