We have our stake pool up and running on the mainnet, great. We set up Prometheus to export system stats, and cardano-node to export on its own Prometheus port. Prometheus is up, and Grafana. When we query metrics, there are a bunch of cardano_node_.* keys available, so I know that it is supplying data. However, there are whole classes of metrics missing - all of the cardano_node_Forge.*, cardano_node_ChainDB.* So, a lot of panels in our Grafana dashboard aren’t working.
Does anyone know why these metrics aren’t being published? I scoured forums, config files, and more to figure out what option we could have done to publish these metrics, but I am not finding anything useful.
did u edited the prometheus.yaml to (did u added al nodes there)
if yes, do u see any data on grafana?
can u try for one node which is not working to check the configuration file - if it has the prometheus IP set to 0.0.0.0 and in case not replace 127.0.0.1 with 0.0.0.0 save the file and restart the node
Hi Alex! Thanks very much for responding. Prometheus.yml contains 2 jobs per each node (1 for node exporter, 1 for cardano-node), so 6 jobs total. Each node is on its own host. In the mainnet-config.json, I did have the public IP address - so I changed it to 0.0.0.0 and restarted cardano-node on the Block Producer. Here is my mainnet-config section:
When I curl localhost:12798/metrics, I get 32 metrics reported listed above.
My Prometheus.yml:
…
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "pll-block-cardano" # Gather metrics from TPR StakePool 1, Relay Server 1
scrape_interval: 5s
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['block-ip-address:12798']
- job_name: "pll-block-node" # Gather metrics from TPR StakePool 1, Relay Server 1, Prometheus
scrape_interval: 5s
static_configs:
- targets: ['block-ip-address:9100']
OK, I don’t see any difference between your config and mine. Other than my Prometheus/Grafana server is a 4th machine. I changed Prometheus IP to 0.0.0.0, but that doesn’t change the fact that it is only exporting 19 “cardano.node.*” metrics on port 12798. I did find a report of this issue, and it said to map the old namespaces, but that doesn’t solve my issue either.
One other thing we tried: we found a note that turning on some tracers outputs some stats to the Prometheus output (see Sdoutriaux reply towards the end of the thread), so we turned on ALL of the tracers that were turned off and restarted the node. That only added 2 more stats, but none of the Forge or ChainDB ones. Maybe the stat names are changed now?
I think I am onto something. I noticed for example that one dashboard panel reports on cardano_node_ChainDB_metrics_slotNum_int, which doesn’t exist, but I did find cardano_node_metrics_slotNum_int. So I think the change in namespace must not have been reflected from this IOHK dashboard I downloaded. However, the same is not true for these metrics:
cardano_node_metrics_Forge_node_is_leader_int - I don’t see any “leader” metrics exported.
There are other panels that use *Forge* metrics, but I can’t find them by the last elements in the namespace (e.g. “node_is_leader”). So half of my issue is resolved by the fact that the ChainDB metrics were renamed.
Thanks for responding. No, I don’t see any metrics with “forge” in the name - the complete list is above. Another key metric that was missing was KES expiration date - but I was able to write a script that outputs that to a text file and is send in Prometheus node exporter text collector, so I have a workaround. I would think that the metric is available no matter what, even if the value is 0 or null…
Please run curl localhost:12798/metrics | grep forge on block producer and put result here, I doubt that it will take you longer then 1 minute, even if you are sure that this is useless.
Please put here your mainnet-config.json file too.
stakepool@pllblock:~/cnode/prometheus$ curl localhost:12798/metrics | grep forge
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1856 0 1856 0 0 97684 0 --:--:-- --:--:-- --:--:-- 100k
stakepool@pllblock:~/cnode/prometheus$
One more point, maybe your cardano node is using different mainnet-config.json file, then one you are trying to edit?
can you check what returns:
ps -ax | grep cardano-node
You should see long command with something like this /usr/local/bin/cardano-node run ... --config /config/mainnet-config.json ...
This means that in my case cardano-node is using /config/mainnet-config.json config.
So if I make changes to other file it will not affect my node. Maybe this is a case here?
Additionally what cardano-node version you are running?
I would like to add that I compared your metrics to mine and I got 63 metrics and you 41, so some are definitely missing, but I would expect all of them to show up if correct config file is updated.
Additionally maybe you can put full output here if you run curl localhost:12798/metrics again?
Very good thinking. There is only one mainnet-config, so I am 100% sure it is using it. The mapping must have been removed in all the troubleshooting, I put it back and restarted node but no luck. The mapping is in both relay nodes but still not exporting forge metrics.
Relay will not have those metrics, only block producer, at least it do not shows up on my relays.
I would recommend you:
to work on only on block producer, at least while you are troubleshooting this
make sure that your block producer has same config as mine, posted in previous post
make sure that config by cardano node is using that config on block producer(I posted command above which will tell what config is in used)
If you still want to troubleshoot this, then:
let me know what cardano node version you are running.
share with me all your metrics what you are getting from block producer, bu running curl localhost:12798/metrics
Yes, I apologize if I gave you the wrong impression. I am only working on the block producer, as I don’t expect these metrics from Relays. I copy & pasted your config file into another file, and did a diff between the 2. There are no differences at the moment. Currently this is what we are seeing: