Missing categories of metrics in Prometheus

We have our stake pool up and running on the mainnet, great. We set up Prometheus to export system stats, and cardano-node to export on its own Prometheus port. Prometheus is up, and Grafana. When we query metrics, there are a bunch of cardano_node_.* keys available, so I know that it is supplying data. However, there are whole classes of metrics missing - all of the cardano_node_Forge.*, cardano_node_ChainDB.* So, a lot of panels in our Grafana dashboard aren’t working.

Does anyone know why these metrics aren’t being published? I scoured forums, config files, and more to figure out what option we could have done to publish these metrics, but I am not finding anything useful.

Thanks for helping out a newb!

For info, this is what the block producer is putting out on port 12798:

rts_gc_par_tot_bytes_copied 32936950728
rts_gc_num_gcs 55049
rts_gc_max_bytes_slop 25752120
cardano_node_metrics_Stat_threads_int 12
cardano_node_metrics_density_real 4.762534726815716e-2
cardano_node_metrics_epoch_int 306
rts_gc_num_bytes_usage_samples 866
cardano_node_metrics_txsInMempool_int 7
rts_gc_wall_ms 15823296
cardano_node_metrics_RTS_gcMinorNum_int 54183
rts_gc_max_bytes_used 2879303112
rts_gc_init_cpu_ms 3
rts_gc_bytes_copied 32936950728
cardano_node_metrics_txsProcessedNum_int 20403
rts_gc_init_wall_ms 8
cardano_node_metrics_RTS_gcLiveBytes_int 3841590864
cardano_node_metrics_mempoolBytes_int 18700
rts_gc_peak_megabytes_allocated 6762
cardano_node_metrics_nodeStartTime_int 1638567824
ekg_server_timestamp_ms 1638583644920
cardano_node_metrics_RTS_gcMajorNum_int 866
cardano_node_metrics_blockNum_int 6582704
rts_gc_par_max_bytes_copied 28102834424
cardano_node_metrics_Stat_cputicks_int 164515
rts_gc_current_bytes_slop 74123696
cardano_node_metrics_RTS_gcticks_int 12839
cardano_node_metrics_RTS_mutticks_int 151674
rts_gc_cpu_ms 1645141
rts_gc_bytes_allocated 1627435947784
rts_gc_current_bytes_used 3841590864
rts_gc_par_avg_bytes_copied 32936950728
cardano_node_metrics_served_header_counter_int 1423
rts_gc_mutator_cpu_ms 1516742
cardano_node_metrics_Mem_resident_int 7141474304
rts_gc_gc_wall_ms 66454
rts_gc_mutator_wall_ms 15756843
rts_gc_gc_cpu_ms 128399
rts_gc_cumulative_bytes_used 11749111352
cardano_node_metrics_slotInEpoch_int 188546
cardano_node_metrics_slotNum_int 47017346
cardano_node_metrics_RTS_gcHeapBytes_int 7090470912

Hi,

did u edited the prometheus.yaml to (did u added al nodes there)

if yes, do u see any data on grafana?
can u try for one node which is not working to check the configuration file - if it has the prometheus IP set to 0.0.0.0 and in case not replace 127.0.0.1 with 0.0.0.0 save the file and restart the node

Hi Alex! Thanks very much for responding. Prometheus.yml contains 2 jobs per each node (1 for node exporter, 1 for cardano-node), so 6 jobs total. Each node is on its own host. In the mainnet-config.json, I did have the public IP address - so I changed it to 0.0.0.0 and restarted cardano-node on the Block Producer. Here is my mainnet-config section:

  ],
  "hasEKG": 12788,
  "hasPrometheus": [
    "0.0.0.0",
    12798
  ],

When I curl localhost:12798/metrics, I get 32 metrics reported listed above.

My Prometheus.yml:

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "pll-block-cardano" # Gather metrics from TPR StakePool 1, Relay Server 1
    scrape_interval: 5s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ['block-ip-address:12798']
  - job_name: "pll-block-node"   # Gather metrics from TPR StakePool 1, Relay Server 1, Prometheus
    scrape_interval: 5s
    static_configs:
      - targets: ['block-ip-address:9100']

I am not using this configuration…

1BP + 2 Relays

here’s mine:

# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'BP_cnode'
    static_configs:
    - targets: ['PRODUCER_IP:12798']
      labels:
        instance: "Producer"
  - job_name: 'BP_node_exporter'
    static_configs:
    - targets: ['PRODUCER_IP:9091']
      labels:
        instance: "Producer"
  - job_name: 'Relay1_cnode'
    static_configs:
    - targets: ['RELAY_1_NODE_EXPORTER_IP:12798']
      labels:
        instance: "Relay1"
  - job_name: 'Relay1_node_exporter'
    static_configs:
    - targets: ['RELAY_1_NODE_EXPORTER_IP:9091']
      labels:
        instance: "Relay1"
  - job_name: 'Relay2_cnode'
    static_configs:
    - targets: ['127.0.0.1:12798']
      labels:
        instance: "Relay2"
  - job_name: 'Relay2_nexporter'
    static_configs:
    - targets: ['127.0.0.1:9091']
      labels:
        instance: "Relay2"

Relay 2 is the server for grafana in my case

  1. on Relay 2 I have opened:

9090/tcp ALLOW 127.0.0.1
12978/tcp ALLOW 127.0.0.1
9091/tcp ALLOW 127.0.0.1

  1. on my other nodes I opened in FW to accept connections from my server to ports 12798 and 9091 (from grafana node server)

u must edit the config file for all nodes (the promotheus IP must be 0.0.0.0) and restart the nodes

“hasPrometheus”: [
“0.0.0.0”,
12798

you can try and after, restart the services

sudo systemctl restart grafana
sudo systemctl status grafana

sudo systemctl restart prometheus
sudo systemctl status prometheus

and if you will want you can use my dashboard (delete from the end of the file/code Charity name)
https://charity-pool.ro/dashboard.txt

OK, I don’t see any difference between your config and mine. Other than my Prometheus/Grafana server is a 4th machine. I changed Prometheus IP to 0.0.0.0, but that doesn’t change the fact that it is only exporting 19 “cardano.node.*” metrics on port 12798. I did find a report of this issue, and it said to map the old namespaces, but that doesn’t solve my issue either.

One other thing we tried: we found a note that turning on some tracers outputs some stats to the Prometheus output (see Sdoutriaux reply towards the end of the thread), so we turned on ALL of the tracers that were turned off and restarted the node. That only added 2 more stats, but none of the Forge or ChainDB ones. Maybe the stat names are changed now?

I think I am onto something. I noticed for example that one dashboard panel reports on cardano_node_ChainDB_metrics_slotNum_int, which doesn’t exist, but I did find cardano_node_metrics_slotNum_int. So I think the change in namespace must not have been reflected from this IOHK dashboard I downloaded. However, the same is not true for these metrics:

cardano_node_metrics_Forge_node_is_leader_int - I don’t see any “leader” metrics exported.

There are other panels that use *Forge* metrics, but I can’t find them by the last elements in the namespace (e.g. “node_is_leader”). So half of my issue is resolved by the fact that the ChainDB metrics were renamed.

hmm… can u share how is define the metric?

Oh, I just edited my response to provide more info. Does that answer your question?

You will get most Forge metrics only when slot will be assigned to you.

cardano_node_metrics_Forge_forged_int - blocks you produced.
cardano_node_metrics_Forge_node_is_leader_int - slot assigned to you

I’m suspecting that you are far away(as me) to get a slot.

Do you see following metric?

cardano_node_metrics_Forge_forge_about_to_lead_int

This metric available even if you don’t have slots assigned to you.

Thanks for responding. No, I don’t see any metrics with “forge” in the name - the complete list is above. Another key metric that was missing was KES expiration date - but I was able to write a script that outputs that to a text file and is send in Prometheus node exporter text collector, so I have a workaround. I would think that the metric is available no matter what, even if the value is 0 or null…

I think there is misunderstanding.

Could you please run this on your block producer:

curl localhost:12798/metrics | grep forge

Yes, I posted the curl above. There are no forge metrics. That’s my issue- I have spent several days trying to get them to appear.

Hi @jimerman !

Please run curl localhost:12798/metrics | grep forge on block producer and put result here, I doubt that it will take you longer then 1 minute, even if you are sure that this is useless.

Please put here your mainnet-config.json file too.

stakepool@pllblock:~/cnode/prometheus$ curl localhost:12798/metrics | grep forge
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1856    0  1856    0     0  97684      0 --:--:-- --:--:-- --:--:--  100k
stakepool@pllblock:~/cnode/prometheus$

mainnet-config.json:

{
  "AlonzoGenesisFile": "mainnet-alonzo-genesis.json",
  "AlonzoGenesisHash": "7e94a15f55d1e82d10f09203fa1d40f8eede58fd8066542cf6566008068ed874",
  "ApplicationName": "cardano-sl",
  "ApplicationVersion": 1,
  "ByronGenesisFile": "mainnet-byron-genesis.json",
  "ByronGenesisHash": "5f20df933584822601f9e3f8c024eb5eb252fe8cefb24d1317dc3d432e940ebb",
  "LastKnownBlockVersion-Alt": 0,
  "LastKnownBlockVersion-Major": 3,
  "LastKnownBlockVersion-Minor": 0,
  "MaxKnownMajorProtocolVersion": 2,
  "Protocol": "Cardano",
  "RequiresNetworkMagic": "RequiresNoMagic",
  "ShelleyGenesisFile": "mainnet-shelley-genesis.json",
  "ShelleyGenesisHash": "1a3be38bcbb7911969283716ad7aa550250226b76a61fc51cc9a9a35d9276d81",
  "TraceAcceptPolicy": true,
  "TraceBlockFetchClient": true,
  "TraceBlockFetchDecisions": true,
  "TraceBlockFetchProtocol": true,
  "TraceBlockFetchProtocolSerialised": true,
  "TraceBlockFetchServer": true,
  "TraceChainDb": true,
  "TraceChainSyncBlockServer": true,
  "TraceChainSyncClient": true,
  "TraceChainSyncHeaderServer": true,
  "TraceChainSyncProtocol": true,
  "TraceConnectionManager": true,
  "TraceDNSResolver": true,
  "TraceDNSSubscription": true,
  "TraceDiffusionInitialization": true,
  "TraceErrorPolicy": true,
  "TraceForge": true,
  "TraceHandshake": true,
  "TraceInboundGovernor": true,
  "TraceIpSubscription": true,
  "TraceLedgerPeers": true,
  "TraceLocalChainSyncProtocol": true,
  "TraceLocalErrorPolicy": true,
  "TraceLocalHandshake": true,
  "TraceLocalRootPeers": true,
  "TraceLocalTxSubmissionProtocol": true,
  "TraceLocalTxSubmissionServer": true,
  "TraceMempool": true,
  "TraceMux": true,
  "TracePeerSelection": true,
  "TracePeerSelectionActions": true,
  "TracePublicRootPeers": true,
  "TraceServer": true,
  "TraceTxInbound": true,
  "TraceTxOutbound": true,
  "TraceTxSubmissionProtocol": true,
  "TracingVerbosity": "NormalVerbosity",
  "TurnOnLogMetrics": true,
  "TurnOnLogging": true,
  "defaultBackends": [
    "KatipBK"
  ],
  "defaultScribes": [
    [
      "StdoutSK",
      "stdout"
    ]
  ],
  "hasEKG": 12788,
  "hasPrometheus": [
    "0.0.0.0",
    12798
  ],
  "minSeverity": "Info",
  "options": {
    "mapBackends": {
      "cardano.node.metrics": [
        "EKGViewBK"
      ]
    },
    "mapSubtrace": {
      "cardano.node.metrics": {
        "subtrace": "Neutral"
      }
    }
  },
  "rotation": {
    "rpKeepFilesNum": 10,
    "rpLogLimitBytes": 5000000,
    "rpMaxAgeHours": 24
  },
  "setupBackends": [
    "KatipBK"
  ],
  "setupScribes": [
    {
      "scFormat": "ScText",
      "scKind": "StdoutSK",
      "scName": "stdout",
      "scRotation": null
    }
  ]
}
1 Like

Hi @jimerman !

Here is my config:

{
  "AlonzoGenesisFile": "mainnet-alonzo-genesis.json",
  "AlonzoGenesisHash": "7e94a15f55d1e82d10f09203fa1d40f8eede58fd8066542cf6566008068ed874",
  "ApplicationName": "cardano-sl",
  "ApplicationVersion": 1,
  "ByronGenesisFile": "mainnet-byron-genesis.json",
  "ByronGenesisHash": "5f20df933584822601f9e3f8c024eb5eb252fe8cefb24d1317dc3d432e940ebb",
  "LastKnownBlockVersion-Alt": 0,
  "LastKnownBlockVersion-Major": 3,
  "LastKnownBlockVersion-Minor": 0,
  "MaxKnownMajorProtocolVersion": 2,
  "Protocol": "Cardano",
  "RequiresNetworkMagic": "RequiresNoMagic",
  "ShelleyGenesisFile": "mainnet-shelley-genesis.json",
  "ShelleyGenesisHash": "1a3be38bcbb7911969283716ad7aa550250226b76a61fc51cc9a9a35d9276d81",
  "TraceAcceptPolicy": true,
  "TraceBlockFetchClient": true,
  "TraceBlockFetchDecisions": true,
  "TraceBlockFetchProtocol": true,
  "TraceBlockFetchProtocolSerialised": true,
  "TraceBlockFetchServer": true,
  "TraceChainDb": true,
  "TraceChainSyncBlockServer": true,
  "TraceChainSyncClient": true,
  "TraceChainSyncHeaderServer": true,
  "TraceChainSyncProtocol": true,
  "TraceConnectionManager": true,
  "TraceDNSResolver": true,
  "TraceDNSSubscription": true,
  "TraceDiffusionInitialization": true,
  "TraceErrorPolicy": true,
  "TraceForge": true,
  "TraceHandshake": true,
  "TraceInboundGovernor": true,
  "TraceIpSubscription": true,
  "TraceLedgerPeers": true,
  "TraceLocalChainSyncProtocol": true,
  "TraceLocalErrorPolicy": true,
  "TraceLocalHandshake": true,
  "TraceLocalRootPeers": true,
  "TraceLocalTxSubmissionProtocol": true,
  "TraceLocalTxSubmissionServer": true,
  "TraceMempool": true,
  "TraceMux": true,
  "TracePeerSelection": true,
  "TracePeerSelectionActions": true,
  "TracePublicRootPeers": true,
  "TraceServer": true,
  "TraceTxInbound": true,
  "TraceTxOutbound": true,
  "TraceTxSubmissionProtocol": true,
  "TracingVerbosity": "NormalVerbosity",
  "TurnOnLogMetrics": true,
  "TurnOnLogging": true,
  "defaultBackends": [
    "KatipBK"
  ],
  "defaultScribes": [
    [
      "StdoutSK",
      "stdout"
    ]
  ],
  "hasEKG": 12788,
  "hasPrometheus": [
    "0.0.0.0",
    12798
  ],
  "minSeverity": "Info",
  "options": {
    "mapBackends": {
      "cardano.node.metrics": [
        "EKGViewBK"
      ],
      "cardano.node.resources": [
        "EKGViewBK"
      ]
    },
    "mapSubtrace": {
      "cardano.node.metrics": {
        "subtrace": "Neutral"
      }
    }
  },
  "rotation": {
    "rpKeepFilesNum": 10,
    "rpLogLimitBytes": 5000000,
    "rpMaxAgeHours": 24
  },
  "setupBackends": [
    "KatipBK"
  ],
  "setupScribes": [
    {
      "scFormat": "ScText",
      "scKind": "StdoutSK",
      "scName": "stdout",
      "scRotation": null
    }
  ]
}

Seems you are missing:

      "cardano.node.resources": [
        "EKGViewBK"
      ]

If you look into confing from oficial repo it still have this part:

Maybe this is why you are missing those metrics?

One more point, maybe your cardano node is using different mainnet-config.json file, then one you are trying to edit?

can you check what returns:

ps -ax | grep cardano-node

You should see long command with something like this /usr/local/bin/cardano-node run ... --config /config/mainnet-config.json ...

This means that in my case cardano-node is using /config/mainnet-config.json config.

So if I make changes to other file it will not affect my node. Maybe this is a case here?

Additionally what cardano-node version you are running?

I would like to add that I compared your metrics to mine and I got 63 metrics and you 41, so some are definitely missing, but I would expect all of them to show up if correct config file is updated.

Additionally maybe you can put full output here if you run curl localhost:12798/metrics again?

Very good thinking. There is only one mainnet-config, so I am 100% sure it is using it. The mapping must have been removed in all the troubleshooting, I put it back and restarted node but no luck. The mapping is in both relay nodes but still not exporting forge metrics.

Hi @jimerman

Relay will not have those metrics, only block producer, at least it do not shows up on my relays.

I would recommend you:

  1. to work on only on block producer, at least while you are troubleshooting this
  2. make sure that your block producer has same config as mine, posted in previous post
  3. make sure that config by cardano node is using that config on block producer(I posted command above which will tell what config is in used)
    If you still want to troubleshoot this, then:
  4. let me know what cardano node version you are running.
  5. share with me all your metrics what you are getting from block producer, bu running curl localhost:12798/metrics

Yes, I apologize if I gave you the wrong impression. I am only working on the block producer, as I don’t expect these metrics from Relays. I copy & pasted your config file into another file, and did a diff between the 2. There are no differences at the moment. Currently this is what we are seeing:

rts_gc_par_tot_bytes_copied 536604898928
rts_gc_num_gcs 92993
rts_gc_max_bytes_slop 47225864
cardano_node_metrics_served_block_count_int 1276
cardano_node_metrics_submissions_submitted_count_int 126180
cardano_node_metrics_Stat_threads_int 15
cardano_node_metrics_density_real 4.912819144204252e-2
cardano_node_metrics_epoch_int 307
rts_gc_num_bytes_usage_samples 940
cardano_node_metrics_txsInMempool_int 25
rts_gc_wall_ms 126152725
cardano_node_metrics_RTS_gcMinorNum_int 92049
rts_gc_max_bytes_used 3575378368
rts_gc_init_cpu_ms 2
rts_gc_bytes_copied 536604898928
cardano_node_metrics_txsProcessedNum_int 102129
rts_gc_init_wall_ms 1
cardano_node_metrics_RTS_gcLiveBytes_int 3584550480
cardano_node_metrics_mempoolBytes_int 31666
rts_gc_peak_megabytes_allocated 10287
cardano_node_metrics_nodeStartTime_int 1639012364
ekg_server_timestamp_ms 1639138515054
cardano_node_metrics_RTS_gcMajorNum_int 940
cardano_node_metrics_blockNum_int 6609648
rts_gc_par_max_bytes_copied 376139039584
cardano_node_metrics_Stat_cputicks_int 3273482
cardano_node_metrics_submissions_accepted_count_int 102145
rts_gc_current_bytes_slop 32570144
cardano_node_metrics_RTS_gcticks_int 2746651
cardano_node_metrics_RTS_mutticks_int 526831
rts_gc_cpu_ms 32735811
rts_gc_bytes_allocated 2818396830136
cardano_node_metrics_submissions_rejected_count_int 24006
cardano_node_metrics_connectedPeers_int 2
rts_gc_current_bytes_used 3591275744
rts_gc_par_avg_bytes_copied 536604898928
cardano_node_metrics_served_header_counter_int 10233
rts_gc_mutator_cpu_ms 5268444
cardano_node_metrics_Mem_resident_int 7288967168
rts_gc_gc_wall_ms 55619792
rts_gc_mutator_wall_ms 70532932
rts_gc_gc_cpu_ms 27467367
rts_gc_cumulative_bytes_used 243480938504
cardano_node_metrics_slotInEpoch_int 311421
cardano_node_metrics_slotNum_int 47572221
cardano_node_metrics_RTS_gcHeapBytes_int 10786701312

Hi @jimerman

What cardano-node version you are running?