I’m gathering Aerospike metrics using Telegraf’s Aerospike input plugin and using these metrics to create a Grafana dashboard. The metrics are inserted into InfluxDB as the datasource.
I’m trying to replicate AMC’s reads/writes per second graphs using the metrics gathered by Telegraf. To do this, I’m looking at client_write_* and client_read_* metrics.
I’m summing up client_write_{success,error,timeout} and doing a DERIVATIVE(1s) to calculate the writes per second. The full InfluxDB query looks something like:
SELECT non_negative_derivative(last("client_write_success"), 1s) AS "success",
non_negative_derivative(last("client_write_error"), 1s) AS "error",
non_negative_derivative(last("client_write_timeout"), 1s) AS "timeout"
FROM "aerospike_namespace"
WHERE ("cluster" = '$DC') AND $timeFilter
GROUP BY time($__interval), "namespace" fill(previous)
(It is graphed as a stacked bar graph)
The results that I get do not match up with what AMC reports. Grafana shows around 500 writes/sec while AMC reports 9500 writes/sec. Are the client_write_* metrics not the right ones? How would one recreate the AMC graphs using the metrics available in https://www.aerospike.com/docs/reference/metrics/.
First I would check whether in one case you are seeing per node TPS and cluster wide TPS in the other case. Do the numbers match with number of nodes in the cluster?
Then in logs for each node, /var/log/aerospike/aerospike.log – you can calculate reads/sec - grep for "{namespace_name}-read " in /var/log/aerospike/aerospike.log - gives count of total reads since server start every 10 sec.
So by your suggestion, I’m doing the calculation based on the log line (for writes), and they’re around 450-500 which is very similar to the summation of client_write_{success,error,timeout}.
It looks like the client_write_* is per node. I thought it wasn’t because the metrics reference said “Cumulative”. So by altering the query to add a further group by node, I get a number much closer to what AMC reports.
But one interesting thing is that the client_write_* across all my nodes are almost the same number (± 1). The writes per second reported via the log calculation can differ from node to node by up to 50 writes/sec.
In the metrics reference page, “cumulative” refer to the metrics that have been accumulated since the process started, “cumulative” metrics increase monotonically during the life of the Aerospike process . There are also “instantaneous” metrics which are metrics that show the current quantity of what is being measured, these are not monotonic. All metics describe the particular node they reside on.
part of that could be due to synchronization between server clocks and server start times. So start time of the ten second accumulation on node1 would not be precisely synced to the start time of the 10 second metric accumulation on node2. Such differences should probably not matter in the bigger scheme of things though, in terms of monitoring.