How to compute average latencies from histograms

How to compute average latencies from histograms

Background

Aerospike produces latency histograms in a specific form. If you require to get average latencies, these need to be consolidated.

Note that reading latency histograms is usually a better and more precise indicator than dealing with averages, which could obscure some issues (long tails). Averages aren’t always indicative of an issue, but do give a different, and consolidated overview.

Theory on Histograms and averages

The histograms look as follows:

Dec 11 09:08:02 test.box asd[146902]: Dec 11 2018 09:08:02 GMT-0500: INFO (info): (hist.c:240) histogram dump: {test}-write (336141385 total) msec
Dec 11 09:08:02 test.box asd[146902]: Dec 11 2018 09:08:02 GMT-0500: INFO (info): (hist.c:257)  (00: 0230265386) (01: 0011678330) (02: 0000662025) (03: 0000252220)
Dec 11 09:08:02 test.box asd[146902]: Dec 11 2018 09:08:02 GMT-0500: INFO (info): (hist.c:257)  (04: 0000331144) (05: 0000861966) (06: 0012075105) (07: 0025383252)
Dec 11 09:08:02 test.box asd[146902]: Dec 11 2018 09:08:02 GMT-0500: INFO (info): (hist.c:257)  (08: 0026520515) (09: 0022199754) (10: 0005834822) (11: 0000075925)
Dec 11 09:08:02 test.box asd[146902]: Dec 11 2018 09:08:02 GMT-0500: INFO (info): (hist.c:266)  (12: 0000000935) (14: 0000000006)

This would be a ‘write’ histogram for the ‘test’ namespace. It is split across multiple lines. Therefore, any parsing will have to take into account the multiline nature of the histogram.

The 00: ... indicates 0th histogram - number of writes. Therefore, in this example we have 0230265386 writes in 0th histogram. The 0th means 2^0 (two to the power of 0), which is up to 1 ms.

The 01: ... would be number of writes which completed in 2^1 = up to 2 ms.

The 02: ... would be number of writes which completed in 2^2 = up to 4 ms.

The 03: ... would be number of writes which completed in 2^3 = up to 8 ms.

And so forth. In order to compute averages, you must add all writes counts, and total ms, and then divide these.

Looking at a simplier histogram: (00: 0000000010) (01: 0000000005) (02: 0000000002)

To calculate this:

Total ms: 2^0*10 + 2^1*5 + 2^2*2 = 1*10 + 2*5 + 4*2 = 28

Total writes: 10+5+2 = 17

Therefore average ms per write would be: 28/17 = up to 1.65

In order to calculate the total for a whole file, you would need to do this across multiple histograms for that {namespace}-histogram and generate an average from that.

Note, importantly, that the histograms are cumulative - i.e. number of transactions since last restart of asd. As such, to make the above calculations possible, for any given namespace-histogram pair, you need to calculate the delta between previous and current histogram and use the delta. This gives the histogram for transactions processed in the last 10 seconds (as histograms print every 10 seconds).

A python script which does just that, including calculating for an existing file and a ‘tail’ like feature

A python script has been created for just that. It comprises of a self-contained class with all the features, and a main code, which allows the script to be run from command line, with arguments. Feel free to either download and run the script (run --help for usage insutructions), or import the AvgHist class into your script to use internally.

avglatency.py

Usage

Usage: avglatency.py [-n|--namespace=] [-o|--histogram=] [-p|--percentile=] [-r|--run] [-t|--tail] [-d|--no-detail] [-s|--no-summary] [-f|--file=] [-h|--help]

	n | namespace	  if set, will only show histograms for that namespace
	o | histogram	  if set, will only show that histogram
	p | percentile	  if set, will discard slowest transactions, only count up to this percent of total
	r | run		  if set, will run through the whole file. If run and tail are not set, run will execute by default
	t | tail	  if set, will tail the file. If run and tail are not set, run will execute by default
	i | ignore-first  if set, ignore each first histogram for each namespace (since they are cumulative)
	d | no-detail	  if set, will not produce per histogram tick detail, only summary
	s | no-summary	  if set, will not produce summary at the end of the run/tail
	f | file	  if set, will process this file. Default: /var/log/aerospike.log
	h | help	  this help screen

Example output from the script when running (run or tail):

TIMESTAMP                      NAMESPACE       HISTOGRAM            TRANSACTIONS   <AVERAGE        MAX TRANSACTION RANGE(ms)
Feb 05 2019 17:06:54 GMT+053   test            read                 62800          1 ms            1 < x < 2           
Feb 05 2019 17:06:54 GMT+053   test            write                62800          1 ms            2 < x < 4           
Feb 05 2019 17:06:54 GMT+053   test            write-start          62800          1 ms            1 < x < 2           
Feb 05 2019 17:06:54 GMT+053   test            write-restart        0              0 ms            0 < x < 0           
Feb 05 2019 17:06:54 GMT+053   test            write-dup-res        0              0 ms            0 < x < 0           
Feb 05 2019 17:06:54 GMT+053   test            write-master         62800          1 ms            1 < x < 2           
Feb 05 2019 17:06:54 GMT+053   test            write-repl-write     62800          1 ms            2 < x < 4           
Feb 05 2019 17:06:54 GMT+053   test            write-response       62800          1 ms            0 < x < 1           
Feb 05 2019 17:06:54 GMT+053   test            query                0              0 ms            0 < x < 0           
Feb 05 2019 17:06:54 GMT+053   test            query-rec-count      0              0 ms            0 < x < 0           
Feb 05 2019 17:07:04 GMT+053   test            read                 64100          1 ms            1 < x < 2           
Feb 05 2019 17:07:04 GMT+053   test            write                64100          1 ms            4 < x < 8           
Feb 05 2019 17:07:04 GMT+053   test            write-start          64100          1 ms            0 < x < 1           
Feb 05 2019 17:07:04 GMT+053   test            write-restart        0              0 ms            0 < x < 0           
Feb 05 2019 17:07:04 GMT+053   test            write-dup-res        0              0 ms            0 < x < 0           
Feb 05 2019 17:07:04 GMT+053   test            write-master         64100          1 ms            0 < x < 1           
Feb 05 2019 17:07:04 GMT+053   test            write-repl-write     64100          1 ms            2 < x < 4           
Feb 05 2019 17:07:04 GMT+053   test            write-response       64100          1 ms            0 < x < 1           
Feb 05 2019 17:07:04 GMT+053   test            query                0              0 ms            0 < x < 0           
Feb 05 2019 17:07:04 GMT+053   test            query-rec-count      0              0 ms            0 < x < 0           

Example output of whole file summary

{test}-write-dup-res: average < 0 ms ; max transaction range: 0 ms < x < 0 ms
{test}-read: average < 1 ms ; max transaction range: 64 ms < x < 128 ms
{test}-query-rec-count: average < 0 ms ; max transaction range: 0 ms < x < 0 ms
{test}-write-restart: average < 1 ms ; max transaction range: 0 ms < x < 1 ms
{test}-write: average < 1 ms ; max transaction range: 512 ms < x < 1024 ms
{test}-write-master: average < 1 ms ; max transaction range: 64 ms < x < 128 ms
{test}-write-repl-write: average < 1 ms ; max transaction range: 512 ms < x < 1024 ms
{test}-query: average < 0 ms ; max transaction range: 0 ms < x < 0 ms
{test}-write-start: average < 1 ms ; max transaction range: 256 ms < x < 512 ms
{test}-write-response: average < 1 ms ; max transaction range: 8 ms < x < 16 ms

Details for code implementation and embedding AvgHist class to own code

Code implementation

In order to use the code as part of own implementation, grab only the self-contained AvgHist class from the avglatency.py file.

Refer to the main code that is outside said class for usage example.

Class init

myAvg = AvgHist("/file", callbackfunc=None, namespace=None, histogram=None, percentile=None)
  • namespace can be None (get all namespaces) or a string denoting the namespace name
  • histogram can be None (get all namespaces) or a string denoting the histogram name
  • percentile can be None (get all transactions) or an int denoting the total percentage of transactions to include (starting from the fastest ones)
  • callbackfunc can be None (do not produce detailed latencies, just summary on exit) or a function that will be called back every time a histogram is processed with the data.
  • ignore_first can be None, True or False. If set to true, it will ignore first occurrance of each namespace-histogram pair it finds (useful since these are cumulative)

Callbackfunc definition

def avg_callback(namespace, histogram, items, average, timestamp, max_ms)

For each histogram processed, the callback function will receive the histogram name, namespace name, number of transactions (items), average latency, timestamp of said histogram and the max transaction time in ms. Note that max transaction time can be anywhere between max_ms and max_ms/2.

Run and Tail

To run the histogram check and produce results, simply run:

myAvg.run()

To tail the log file instead, run:

myAvg.tail()

Whole file and on-exit summary

Both myAvg.run() and myAvg.tail() produce a summary. A run through file produces once end of log file is reached. The tail will produce the overall summary once KeyboardInterrupt is received.

In order to receive latency summaries of either tail or run:

for latency in myAvg.run():
    [...]

for latency in myAvg.tail():
    [...]

The latency dict contains the following:

  • latency[‘namespace’]
  • latency[‘histogram’]
  • latency[‘average’]
  • latency[‘max_ms’]

Keywords

histogram average latency latencies

Timestamp

12/27/18

Will this be shipped as part of the tools package?

After Line 13:

ms = 0

, Add Line 14:

max_ms = 0

. (Uninitialized member, I ran into errors when testing with one of the options due to this.)

Line 213: the default log file is: /var/log/aerospike.log - typically this should be

/var/log/aerospike/aerospike.log

Also, use python ver 2.7. If using ver 2.6, format statement on line 173 will need indexes:

detail_format = "{0: &lt;30} {1: &lt;20} {2: &lt;20} {3: &lt;14} {4: &lt;15} {5: &lt;20}"