Server full error when not full

Hello,

one of the nodes in the cluster started to return Server full exception. The console looks like this:

Bad thing is that it happend multiple times on different nodes. We solved that by deleting the data file on the node and let them replicate from other nodes, but the problem seems systematic.

What seem strange is the node says used 18%, available: 0% so they do not sum to 100%, but it does not on other nodes too, but on other nodes the difference is around 10%. All the node statistics seem coparable to other nodes.

I provide a part of log file and part of gdb output. I also have backup of the “full” data file (which seems almost empty) and I also have output from asmonitor - collectinfo tool.

Any ideas, how to explore the problem?

Jun 09 2016 10:17:25 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /www/aerospike/data/er.dat: used 1013766400, contig-free 15169M (15169 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s) defrag-w-tot 0 (0.0/s)
Jun 09 2016 10:17:37 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /www/aerospike/data/abox.dat: used 0, contig-free 1021M (1021 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 1 (0.0/s) defrag-w-tot 0 (0.0/s)
Jun 09 2016 10:17:45 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /www/aerospike/data/er.dat: used 1013766400, contig-free 15169M (15169 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 0 (0.0/s) defrag-w-tot 0

Jun 09 2016 07:45:43 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /opt/aerospike/data/aero_scan.dat: used 22505230592, contig-free 768M (768 wblocks), swb-free 15, w-q 0 w-tot 24648627
5 (219.3/s), defrag-q 95585 defrag-tot 246493114 (220.3/s) defrag-w-tot 103630785 (3.5/s)

Jun 09 2016 07:45:43 GMT: WARNING (rw): (thr_rw.c::2453) {scan}: write_local_pickled: drives full
Jun 09 2016 07:45:43 GMT: WARNING (rw): (thr_rw.c::2453) {scan}: write_local_pickled: drives full
Jun 09 2016 07:45:43 GMT: WARNING (rw): (thr_rw.c::3418) {scan}: write_local: drives full
Jun 09 2016 07:45:43 GMT: WARNING (rw): (thr_rw.c::2453) {scan}: write_local_pickled: drives full
Jun 09 2016 07:45:43 GMT: WARNING (rw): (thr_rw.c::3418) {scan}: write_local: drives full
Jun 09 2016 07:45:43 GMT: WARNING (rw): (thr_rw.c::3418) {scan}: write_local: drives full
Jun 09 2016 07:45:43 GMT: WARNING (rw): (thr_rw.c::2453) {scan}: write_local_pickled: drives full
Jun 09 2016 07:45:43 GMT: INFO (rw): (thr_rw.c::2861) [NOTICE] writing pickled failed(-1):<Digest>:0x0d6f6a70b1a4eead46240641749ccbb0f3b4e30c
Jun 09 2016 07:45:43 GMT: INFO (rw): (thr_rw.c::2861) [NOTICE] writing pickled failed(-1):<Digest>:0x89ad25997e10640b27ee4bc2d36c1a35497cd655
Jun 09 2016 07:45:43 GMT: INFO (rw): (thr_rw.c::2861) [NOTICE] writing pickled failed(-1):<Digest>:0x5af448122da58488bd4ac1b07371346bb7f15afd

Defrag appears to be unable to keep up with your write load on the underlying storage devices.

You are writing at 219.3 wblocks (see write-block-size in config) per second, defrag is processing 220.3 wblocks per second and after compacting them writes them back at 3.5 wblocks per second. The write rate exceeds the number of wblocks defrag is providing resulting in the runaway situation here.

You can increase the defrag rate by reducing the defrag-sleep which by default is 1ms per wblock read.

Thanks a lot, reducing defrag-sleep helped.