Cluster in stop-writes but UDFs continue writing


#1

Cluster in stop-writes but UDFs continue writing

Problem Description

In Aerospike releases prior to Aerospike 3.8.0 a node has hit stop-writes however avail-pct continues to decrease. Analysis of the logs shows that UDFs are still executing and that the writes are being accepted by the database.

Explanation

There is a bug in Aerospike releases prior to 3.8.0 whereby UDF writes, will go through even when a database is in stop-writes. The aerospike.log will show:

aerospike.log-20160405:Apr 05 2016 06:10:47 GMT: WARNING (namespace): (namespace.c::457) {problem_namespace} hwm_breached false, stop_writes true (disk avail pct), memory sz:20890709632 (20890709632 + 0) hwm:34789236736 sw:52183851008, disk sz:777470015488 hwm:896183828480

Using either asadm in analysis mode or asloglatency we can see that UDFs are still executing.

Admin> loglatency -h udf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~cluster~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                :   s1       .        .         .         |   
.                   :   % >1ms   % >8ms   % >64ms   ops/sec   |   
Apr 05 2016 06:10:17:   4.38     0.06     0.00      360.8     |   
Apr 05 2016 06:10:27:   5.92     0.06     0.00      319.0     |   
Apr 05 2016 06:10:37:   5.53     0.12     0.00      403.6     |   
Apr 05 2016 06:10:47:   3.48     0.02     0.00      448.6     |   
Apr 05 2016 06:10:57:   5.93     0.06     0.00      330.7     |   
Apr 05 2016 06:11:07:   6.44     0.20     0.00      295.1     |   
Apr 05 2016 06:11:17:   5.06     0.00     0.00      385.6     |   
Apr 05 2016 06:11:27:   8.98     0.07     0.00      405.4     |   
Apr 05 2016 06:11:37:   6.66     0.14     0.00      437.2     |   
Apr 05 2016 06:11:47:   5.11     0.05     0.03      381.7     |   
Apr 05 2016 06:11:57:   6.67     0.00     0.00      316.2     |   
Apr 05 2016 06:12:07:   5.71     0.04     0.00      276.8     |   

Log messages indicate that these UDFs are writing data as the w-tot count for the devices on the namespace is increasing.

aerospike.log-20160405:Apr 05 2016 06:16:50 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /dev/xvdb: used 388800305152, contig-free 38406M (76813 wblocks), swb-free 16, w-q 0 w-tot 124912121 (17.0/s), defrag-q 0 defrag-tot 124321385 (16.6/s) defrag-w-tot 57987003 (7.9/s)
aerospike.log-20160405:Apr 05 2016 06:17:00 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /dev/xvdc: used 388722930048, contig-free 37535M (75071 wblocks), swb-free 16, w-q 0 w-tot 126356071 (15.6/s), defrag-q 0 defrag-tot 125706130 (14.6/s) defrag-w-tot 58184806 (6.9/s)
aerospike.log-20160405:Apr 05 2016 06:17:10 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /dev/xvdb: used 388801634048, contig-free 38376M (76752 wblocks), swb-free 16, w-q 0 w-tot 124912419 (14.9/s), defrag-q 0 defrag-tot 124321650 (13.2/s) defrag-w-tot 57987129 (6.3/s)
aerospike.log-20160405:Apr 05 2016 06:17:20 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /dev/xvdc: used 388724107264, contig-free 37511M (75023 wbl

Solution

This is a bug within Aerospike and has been resolved in Aerospike releases 3.8.x and higher. To resolve this issue the product should be upgraded.

As a temporary measure, shutting down all UDFs should be included in operational process if a stop-write condition occurs.

Notes

  • This is tracked under JIRA AER-4907
  • The asadm tool includes rich functionality for log and latency analysis in addition to a comprehensive command line administrative interface. It can be cloned from the following Git repository. https://github.com/aerospike/aerospike-admin

Keywords

UDF STOP-WRITES

Timestamp

5/6/16