unfortunately, no.
Not now, when u see gaps in grafana
try journalctl -e -f -u cardano-node | grep kill
still nothing unfortunately
PS: thanks for all the help, really appreciate it!
Well I’m out of ideas. Just turned off mempool which fixes the dropped metrics but not a fan of not being able to see tx.
Did you manually change anything in Prometheus on your old server that needs to change now on new server? Added scrape targets? Lowered memory? If Prometheus has too little memory or goes into rushed mode it may explain that graph.
Also, is it possible that you bought 16 GB VPS and they gave you less, or it will switch next payment cycle to new VPS config or there is any delay until you get your new configs? Some companies take from 1 hour to 2 days for your upgrade to be fully available.
I think you might be right. I actually gave up on trying to figure out what was causing the issue and instead decided to do a fresh reinstall of my whole server (made sure ram + cpu configs were actually right this time!). Currently syncing the blockchain and seems to be working well on grafana. Thanks for all the help you two!
Ugh, nevermind. Looks like its still dropping metrics. The node itself looks like it is working just fine but the I keep seeing gaps in the metrics like before
Out of curiosity what do you have your scrape interval ad? Mine is the default 15sec. Wondering if I should increase to 30sec and see if it makes a difference
I wasn’t thinking it was global global_scrape interval at all. I thought that maybe you added a scraper yourself and it was collecting too much data.
I wouldn’t worry about global_scrape if it’s default. That’s more of an optimization issues. Just try to troubleshoot this issue first.
So, if you did add custom scrapers, remove them and see what happens. You may be ingesting metrics you don’t need. If you want to change scrape_interval, do it in scrape_configs, so you can override global interval for that scraper only. This is just trial and error so we can determine what is (or is not) causing this.
Also, can you share specifics of your upgrade. What it was to what it is now, as per VPS. Does it say 6 Core CPU or vCPU? Are the resources shared?
There is also possibility that your metrics are corrupt and that they don’t represent what is actually happening.
my current specs according to the VPS provider is 6 vCPU, 16 GB ram, 400 SSD.
Going to take a further look today at the settings but I cant recall adding anything or having a custom scraper for my BP. I only installed the prometheus_node exporter and nothing further
Once I get a sec later today I’ll check again
Then it probably isn’t Prometheus.
Can you check with your provider if you are on shared plan? vCPU are time dependent and some VPS providers try to maximize by sharing resources or try to sell at lower price as shared. Those type of servers may not be able to handle persistent BP node. Those drops may be due to others load spiking and you dropping.
try
grep processor /proc/cpuinfo
see what it says.
Sorry for the delay. Clinic has been crazy busy.
The command shows:
Just removed Prometheus node exporter on my BP to see if reinstalling it would help but realized that grafana is still pulling metrics even without it installed. Is that normal? Thought the exporter had to be running to get metrics scraped.
Check the promotheus.yml file from grafana server and check to which IP is pointing (for BP)
This is my prometheus.yml:
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label job=<job_name> to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9100']
- targets: ['<Block Prod IP>:9100']
- targets: ['<Block Prod IP>:12798']
labels:
alias: 'block-producer-node'
type: 'cardano-node'
- targets: ['<relay node 2 IP>:9100']
- targets: ['<relay node 2 IP>:12798']
labels:
alias: 'relaynode2'
type: 'cardano-node'
- targets: ['localhost:12798']
labels:
alias: 'relaynode1'
type: 'cardano-node'
As @Alexd1985 suggested. Can you confirm that your BP IP from
are your new (upgraded) server IPs and not your old server IPs?
Also, when you check <IP.Monitoring.Server>:9090 in your client browser under Status / Targets what does it show?
my new upgraded servers have the same IP as before.
Below is the results from the from the <IP.Monitoring.Server>:9090:
Try this:
Under :
Add
scrape_timeout: 10s
-and
Under:
add
sample_limit 1000
Check the memory usage metrics after a few hours and post it here.
This is pretty much my last idea for Prometheus, unless you contact Contabo and ask if this is on their end.
Thanks Neo,
I actually migrated my BP to a new server and wouldn’t you know it the problem is solved. Definitely think you were right that there was something going on with the server though for the life of me I couldn’t figure it out.
Appreciate all the help. I owe you and Alex a drink (or several).