"EOF" error and server performance degradation


#1

hi,

We found AS server 3.5.9 write performance is not stable. From time to time, writes_master performance appears degradation (reads performance keeps normal). And Go client will show many EOF errors during degradation.

Here’s the command " asloglatency -h writes_master -n 8" output:

slice-to (sec)      1      8     64    512   4096  32768  ops/sec
-------------- ------ ------ ------ ------ ------ ------ --------
.......
06:31:03    10   3.02   0.01   0.00   0.00   0.00   0.00    688.6
06:31:13    10   3.41   0.00   0.00   0.00   0.00   0.00    682.4
06:31:23    10   9.93   6.93   6.93   6.93   0.00   0.00    453.0
06:31:33    10   4.91   1.19   1.14   0.70   0.00   0.00    657.6
06:31:43    10   3.27   0.00   0.00   0.00   0.00   0.00    645.3
06:31:53    10   2.67   0.00   0.00   0.00   0.00   0.00    660.2
06:32:03    10   3.49   0.00   0.00   0.00   0.00   0.00    647.8
.......
07:00:14    10   5.31   0.05   0.00   0.00   0.00   0.00   1015.4
07:00:24    10   4.16   0.01   0.00   0.00   0.00   0.00    911.9
07:00:34    10   3.31   0.00   0.00   0.00   0.00   0.00    771.4
07:00:44    10   7.01   3.60   3.46   2.88   0.00   0.00    710.9
07:00:54    10   7.12   3.31   3.11   1.92   0.00   0.00    676.0
07:01:04    10   3.50   0.09   0.00   0.00   0.00   0.00    659.6
07:01:14    10   2.66   0.00   0.00   0.00   0.00   0.00    661.7

Is there anything wrong and how can I fix this issue ?

Thanks.


#2

Thanks for reaching out on our forum. It is definitely not possible to provide much hint based on this input.

My first suggestion would be to check if there is any pattern to those degradations. Do they come at regular interval? Then, try to correlate with other similar patterns, for example storage performance (if this is on a namespace that has disk storage) or something else (looking at the aerospike logs) that may be happening with similar interval.


#3

hi meher,

It’s not at regular interval. The interval could be 3mins, 1hour, 20hours …

The cluster is composed by 5 nodes. serve only 1 namespace.

The namespace if memory-only with HDD storage. here’s the config:

namespace foobar {
	replication-factor 2
	memory-size 32G
	storage-engine memory
        ldt-enabled   true

	storage-engine device {
		file /data/push.dat
		filesize 128G
		data-in-memory true
	}
}

The memory usage is about 40%.

Thanks


#4

I notice that LDT are enabled on this namespace. Is there LDT traffic going on at the same time as the regular write traffic?

Could you record iostat and capture how it evolves during those latency spikes? (to check if it could be caused by the underlying device).

You could also turn on microbenchmarks and analyze as described in this post about write performance analysis, especially to see if those latency spikes are caused by the network.