Useful debugging tools for pool operators

:construction: This is just a thread to document useful things I do when operating a pool. The aim is provide a wealth of information for pool operators to debug issues such as missed slots. I’ll start the thread with some CNODE related workflows.

Please share any other useful processes in the comments.

CNODE (guild operators)


Networking:

Use cncli ping from hostA to hostB to ensure ports are accessible
cncli ping --host <IP> --port <PORT>

Block trouble shooting:

cd /opt/cardano/cnode/guild-db/blocklog/
sqlite3 blocklog.db "SELECT * FROM blocklog WHERE epoch=370"

Example output

503|74572767|"2022-10-19T00:24:18+00:00"|370|7901572|95967|ea5d0dcbd488e0b225164bd948888d53774ea04fa5618cbce6822f95632d2553|60022|confirmed

Notice the output contains a date, convert it to the format below to check cnode logs for that time.
:warning: Please note timezones can affect this value (see the alernative, if you are unsure)

sudo journalctl --unit=cnode.service --since "2022-10-19 00:24:18" -n 20

alternatively you can specify the slot
:warning: The following command can take some time to execute

sudo journalctl --unit=cnode.service | grep 74572767

Noticed some useful content in this thread

Out of interest, what is creating the “backlog.db” sqlite database? Is that database being filled with leaderlog data generated by the cncli tool? Could you please point me to the script that generates the data?

I think you meant blocklog (not backlog) and it’s the cncli.sh script found at /opt/cardano/cnode/scripts/cncli.sh (if you use CNODE) :slight_smile:

OK, I see it now. It is in cncli.sh script: https://raw.githubusercontent.com/cardano-community/guild-operators/alpha/scripts/cnode-helper-scripts/cncli.sh

Function: cncliLeaderlog() (146 lines long). Within this function is the following:

if [[ $(jq -r .status <<< "${cncli_leaderlog}") != ok ]]; then
      error_msg=$(jq -r .errorMessage <<< "${cncli_leaderlog}")
      if [[ "${error_msg}" = "Query returned no rows" ]]; then
        echo "No leader slots found for epoch ${curr_epoch} :("
      else
        echo "ERROR: failure in leaderlog while running:"
        echo "${CNCLI} leaderlog --consensus ${consensus} --db ${CNCLI_DB} --byron-genesis ${BYRON_GENESIS_JSON} --shelley-genesis ${GENESIS_JSON} --ledger-set current ${stake_param_current} --pool-id ${POOL_ID} --pool-vrf-skey ${POOL_VRF_SKEY} --tz UTC"
        echo "Error message: ${error_msg}"
        exit 1
      fi
    else
      epoch_nonce=$(jq -r '.epochNonce' <<< "${cncli_leaderlog}")
      pool_id=$(jq -r '.poolId' <<< "${cncli_leaderlog}")
      sigma=$(jq -r '.sigma' <<< "${cncli_leaderlog}")
      d=$(jq -r '.d' <<< "${cncli_leaderlog}")
      epoch_slots_ideal=$(jq -r '.epochSlotsIdeal //0' <<< "${cncli_leaderlog}")
      max_performance=$(jq -r '.maxPerformance //0' <<< "${cncli_leaderlog}")
      active_stake=$(jq -r '.activeStake //0' <<< "${cncli_leaderlog}")
      total_active_stake=$(jq -r '.totalActiveStake //0' <<< "${cncli_leaderlog}")
      sqlite3 ${BLOCKLOG_DB} <<-EOF
				UPDATE OR IGNORE epochdata SET epoch_nonce = '${epoch_nonce}', sigma = '${sigma}', d = ${d}, epoch_slots_ideal = ${epoch_slots_ideal}, max_performance = ${max_performance}, active_stake = '${active_stake}', total_active_stake = '${total_active_stake}'
				WHERE epoch = ${curr_epoch} AND pool_id = '${pool_id}';
				INSERT OR IGNORE INTO epochdata (epoch, epoch_nonce, pool_id, sigma, d, epoch_slots_ideal, max_performance, active_stake, total_active_stake)
				VALUES (${curr_epoch}, '${epoch_nonce}', '${pool_id}', '${sigma}', ${d}, ${epoch_slots_ideal}, ${max_performance}, '${active_stake}', '${total_active_stake}');
				EOF

Which is updating a sqlite database “$BLOCKLOG_DB”, originally created by the same script (function createBlocklogDB).

So this script implements functionality to store the stake pool leaderlogs into a sqlite database instead of just saving the data as a flat file. Then this script, or others, can make use of this “blocklog” database when looking at logs, or maybe cncli database, about the chain adopted blocks if the pool operator identifies a problem.

Do pool operators set a cron job to automatically call that script every 5 days or is it intended to be run manually?

Also is there a script that automatically checks the chain blocks in the cncli database and emails the operator if a block is missed?

Is there another script that automates checking the logs if a missed block is noticed?