Clarify a case of "Server memory error"

This time, my Aerospike instance run in a single workstation PC, with a single thread that keeps read/write/delete records, the program reads data and write-update to existing records and won’t write to new records. disk write keeps in 200-300MBps. Before it fails, the defrag-q keep increasing (129627->138439-> 149446), and contig-free drops sharply (5G->3G->1.5G->75M), then the Java client throws

ERROR com.aerospike.client.AerospikeException:
Error Code 8: Server memory error
        at com.aerospike.client.command.WriteCommand.parseResult (WriteCommand.java:59)

This is a development system, hardware is not server-grade, and I mainly view it as an opportunity to learn about Aerospike’s behavior and limitation. My questions are:

  1. Apparently, the system can’t defrag the write blocks fast enough. And then there is not enough free write blocks to write data. Is it a correct description? The solution is to configure defragmentation to run more frequently? There is another forum thread talk about it in details.
  2. Server memory error refers to contig-free? or defrag-q? or both? defrag-q is hold in memory. obviously, if contig-free drops to near 0, it makes sense to get error, but shouldn’t be called memory error?!
  3. For defrag-q, unlike w-q, there seems to be no option to configure max size of defrag-q? except in my case that my data write to existing blocks, increment of defrag-q does no harm except risk of data loss when server crashes? if there is no configurable/hard limit, will it just consume all the system RAM? (there are a lot of free memory in this case)

Thanks.

Reference - doc

contig-free: Amount of space occupied by free wblocks, and the number of wblocks free in parenthesis 
defrag-q: Number of wblocks pending defrag

Reference - log

Nov 10 2014 10:17:37 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /ssd/aerospike/aibo.db: used 5230367744, contig-free 4955M (19823 wblocks), swb-free 8, n-w 0, w-q 0 w-tot 2550474 (354.2/s), defrag-q 129627 defrag-tot 2535305 (357.5/s)
Nov 10 2014 10:17:57 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /ssd/aerospike/aibo.db: used 5225796224, contig-free 3907M (15630 wblocks), swb-free 8, n-w 0, w-q 0 w-tot 2561988 (575.7/s), defrag-q 133842 defrag-tot 2546832 (576.3/s)
Nov 10 2014 10:18:17 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /ssd/aerospike/aibo.db: used 5220899328, contig-free 2778M (11115 wblocks), swb-free 8, n-w 0, w-q 0 w-tot 2574351 (618.2/s), defrag-q 138439 defrag-tot 2559266 (621.7/s)
Nov 10 2014 10:18:37 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /ssd/aerospike/aibo.db: used 5217150592, contig-free 2008M (8034 wblocks), swb-free 8, n-w 0, w-q 0 w-tot 2583777 (471.3/s), defrag-q 141543 defrag-tot 2568708 (472.1/s)
Nov 10 2014 10:18:57 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /ssd/aerospike/aibo.db: used 5214418560, contig-free 1438M (5754 wblocks), swb-free 8, n-w 0, w-q 0 w-tot 2590680 (345.1/s), defrag-q 143879 defrag-tot 2575658 (347.5/s)
Nov 10 2014 10:19:17 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /ssd/aerospike/aibo.db: used 5208543488, contig-free 75M (303 wblocks), swb-free 9, n-w 0, w-q 0 w-tot 2605512 (741.6/s), defrag-q 149446 defrag-tot 2590599 (747.0/s)

Coming to your questions…

  1. Yes, in this case it seems that defrag is not able to keep up with the write/update rate. How much update throughput are you having ? In general, you should size the system such that the defrag is able to keep up with the update throughput. That may also mean allocating larger disk space. Aggressive defrag is not always the solution.

  2. The error code 8 means the capacity on the server side is full. Actually, its not accurate to print the string as “server memory error” as this can be because of multiple reasons.

  • memory reached stop-writes threshold (default 90%)
  • disk reached stop-writes threshold (default 90%)
  • contig free space falls below stop-writes threashold (default 5%)
  1. There is no limit imposed on the defrag-q as all the elements needs to be defragged. We cannot decide to not defrag a block because the queue is full. This queue is nothing but a series of integers (blockids). So, memory foot print should be small for this queue. Nevertheless, its not a healthy situation to have a very large queue. The key is to size the system properly for the usecase. (Ref: http://www.aerospike.com/docs/operations/plan/capacity/)