Network delay calculation based on logs

Hi There!

Recently my pool missed to create a valid block. and the reason was that the block was produced not on the highest block but on the previous one. So I tried to calculate what was the delay time between the highest block production time and the actual time when the pool was received it.
so here is the command - right now it only works on text format, not json format:

grep -h "new tip" node.log | sed -rn "s/\[(.*)\].\[(.*)\].* ([0-9]+)/\2,\3/p" | awk -F, -v threshold=2 '{"date +%s.%3N -d\""$1"\""|getline actualTime; delay=actualTime-$2-1591566291; sum+=delay; if (delay > threshold ) print $1 " " delay;} END {print "AVG=", sum/NR}'

and here is an example line which will be matched:
[dell:cardano.node.ChainDB:Notice:265] [2021-03-25 21:04:33.57 UTC] Chain extended, new tip: 720267abd1f74bfcc9eb6a107ffe09c95188e371d6b6fd17907771e47c66d5bc at slot 25139982

details of the command

  • grep -h "new tip" node.log - find the line which contains “new tip” in file node.log
  • sed -rn "s/\[(.*)\].\[(.*)\].* ([0-9]+)/\2,\3/p" - extract actual time (2021-03-25 21:04:33.57 UTC) and slot (25139982) and print the numbers between comma
  • awk -F, - use comma as splitter
  • -v threshold=2 - define threshold, which is 2 sec
  • date +%s.%3N -d”"$1"""|getline actualTime - convert actual time to sec (1st column)
  • delay=actualTime-$2-1591566291 calculate delay by extracting actual time and slot time (1591566291 is a kind of magic number - it tells what was the time in sec at slot 0)
  • sum+=delay - collect all the delay times
  • if (delay > threshold ) print $1 " " delay - print actual time and delay if the delay is more than the threshold
  • END {print “AVG=”, sum/NR} - print the average delay time at the end…

so the out of this command when executed on logs of my block producer node:

2021-03-21 22:14:05.65 UTC 2.65
2021-03-21 22:17:33.26 UTC 4.26
2021-03-21 22:40:25.05 UTC 3.05
2021-03-21 22:55:47.29 UTC 27.29
2021-03-21 23:14:01.77 UTC 4.77
2021-03-21 23:15:38.31 UTC 2.31
2021-03-21 23:32:10.47 UTC 2.47
2021-03-21 23:33:27.59 UTC 3.59
2021-03-21 23:47:22.26 UTC 16.26
2021-03-21 23:51:37.23 UTC 17.23
2021-03-21 23:56:10.97 UTC 4.97
AVG= 0.684816

so the node gets the blocks with an average of 0.685 milisec delay after it was produced.
guess what - my block was created when the delay was ~16 sec… at 2021-03-21 23:47:22.26 UTC

I think this can be an other metric or a performance indicator of a node.
So - what is the network delay of you pool? Would be curious what is the number of others!

yes, would be easy to generate standard deviation - will let you know if it is ready.
Just a quick update on my node performance - now it is way more better:
AVG= 0.27719 compared to recent 0.685

And that this the script for json format:
grep newtip node.log | jq -r '.at + "," + .data.newtip[65:]' | awk -F, -v threshold=2 '{"date +%s.%3N -d\""$1"\""|getline actualTime; delay=actualTime-$2-1591566291; sum+=delay; if (delay > threshold ) print $1 " " delay;} END {print "AVG=", sum/NR}'

this is an example of a json format line:

{"thread":"331","loc":null,"data":{"newtip":"b74656b1735655c82e78894bfa811933357ec3cdee58b02675726ab559efb063@26067186","kind":"TraceAddBlockEvent.SwitchedToAFork"},"sev":"Notice","env":"1.26.2:35312","msg":"","app":[],"host":"dell","pid":"11547","ns":["cardano.node.ChainDB"],"at":"2021-05-08T01:13:22.33Z"}

and this is an updated version of text format, which can handle journalctl log format
example of a journalctl line:

May 06 21:05:06 ip-172-xxx-xxx-xxx cardano-node[555]: [ip-172-xxx:cardano.node.ChainDB:Notice:63] [2021-05-07 04:05:06.27 UTC] Chain extended, new tip: 3d82521d578f5b96f9845560c0da7be2114830d733b3a760c1484482c453c1bc at slot 28794015

command to execute:

sudo journalctl -u cardano-node.service > node.log
grep -h "new tip" node.log | sed -rn "s/.*\[(.*)\].* ([0-9]+)/\1,\2/p" | awk -F, -v threshold=2 '{"date +%s.%3N -d\""$1"\""|getline actualTime; delay=actualTime-$2-1591566291; sum+=delay; if (delay > threshold ) print $1 " " delay;} END {print "AVG=", sum/NR}'

1 Like

Thanks @laplasz for sharing, and thanks again for your overall contributions to the forums. I found your topic here after searching through the different topics about missed slots (I’ve been encountering some). I had to tweak the command to limit the piped output to a 1000 lines (I was getting a “too many files open” error when I ran it as-is)

My numbers look better than your initial numbers on April 1, but not as good as your April 27 numbers. Did you make any changes to get that improvement? Here are my numbers
xxxxxxx@bp-node:~/cardano-node$ ./check_delay.sh
2021-06-19 14:08:01.25 UTC 2.25
AVG= 0.31306

Missed slots in the 16 hours since restart: 6

Running 2 CPU, 16GB RAM, dedicated servers, not shared. Tracemempool false

1 Like

Hi!

This is the current stat:

2021-06-19 19:28:38.25 UTC 2.25
AVG= 0.275593

and current incoming connection number is 26…
netstat -an|awk "\$4 ~ /3001/"|grep -c ESTABLISHED
where 3001 is the listening port of the node

What I am experiencing is more incoming connections → less delay - so how many connections do you have?

just two connections coming into the bp node. both from my relay nodes. I thought the bp node should be shutoff to all but our relays

It’s been about an hour since I resized the instance from 2 cpu to 4cpu and restarted Current stat

2021-06-19 19:28:38.47 UTC 2.47
2021-06-19 19:57:06.40 UTC 355.4
AVG= 0.65735

Hi!

sorry, yes - those stats from my relays.
Actually the stats of your BP is good. 0.31 is a good value.
So that is interesting why there are missed slots even with this stats… perhaps upgrading the machine is a good test - and if it helps - let us know

An update: when I initially ran the commands to calculate any delays, these were on ubuntu 21.04 instances. I have since moved to Contabo virtual dedicated servers, choosing to go back to the long-term support version of ubuntu, 20.04.2. Felt safer doing that. For some reason, the script wouldn’t work on 20.04.02. Didn’t seem to like the "date +%s.%3N -d\""$1"\"" call.

/bin/sh: 1: Syntax error: Unterminated quoted string

I posted a question at Stackoverflow, and ended up redoing the awk portion of the script, using mktime instead of date. It works now, but it doesn’t have visibility to the milliseconds, but that’s fine.

~/cardano-node$ ./check_delay.sh
2021-07-12 22:52:01.22 UTC 5
2021-07-13 02:34:06.03 UTC 2
AVG= 0.014
grep -h "new tip" logs/node.log | tail -1000 | sed -rn "s/\[(.*)\].\[(.*)\].* ([0-9]+)/\2,\3/p" | mawk -F, -v threshold=2 -f check_delay.awk

check_delay.awk:
BEGIN{ ENVIRON["TZ"] = "UTC"; }
{
  year = substr($1,1,4);
  month = substr($1,6,2);
  day = substr($1,9,2);
  hour = substr($1,12,2);
  min = substr($1,15,2);
  sec = substr($1,18,2);
  timestamp = year" "month" "day" "hour" "min" "sec;
  actualTime=mktime(timestamp) + 7200;
        delay=actualTime-$2-1591566291;
        sum+=delay;
        if (delay >= threshold )
                print $1 " " delay;
}
END {print "AVG=", sum/NR}

Added the 7200 to the calculation to account for two hour difference between UTC of the log date stamps and CEST used by mktime (server in Germany). Couldn’t seem to force mktime to think in UTC