Aerospike 3.6.0 getting stopped after some time

Hi Aerospike Team,

We have been using Aerospike for quiet some time and it was working fine till some time back. for past few weeks, we are noticing a strange behavior. the aerospike service gets started. but it gets stopped after some time. Checked the logs. They look clean. Can somebody help? /etc/aerospike/aerospike.conf

# Aerospike database configuration file.

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 4
	transaction-queues 4
	transaction-threads-per-queue 4
	proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
	#file /var/log/aerospike/aerospike_debug.log {
        #context any debug
    	#}
	 file /var/log/aerospike/aerospike_critical.log {
        context any critical
    }
	 #file /var/log/aerospike/aerospike_warning.log {
        #context any warning
    #} 
	#file /var/log/aerospike/aerospike_detail.log {
        #context any detail
    #}
}

network {
	service {
		address any
		port 3000
	}

	heartbeat {
		mode mesh
		#address 239.1.99.222
		port 3002

		# To use unicast-mesh heartbeats, remove the 3 lines above, and see
		# aerospike_mesh.conf for alternative.
		mesh-seed-address-port 172.31.7.207  3002
		interval 300
		timeout 10
	}

	fabric {
		address any
		port 3001
	}

	info {
		port 3003
	}
}

# In the Test setup the instance type assummed is t2.medium
namespace cdbapplication {
    replication-factor 1
    memory-size 20G
    default-ttl 0  # 30 days, use 0 to never expire/evict.

    storage-engine device {
        #device /dev/xvdi
        file /asdb/aerospike/data/cdbapplication.dat
        filesize 20G
        data-in-memory true # Store data in memory in addition to file.
    }
}

namespace cdbcontent {
    replication-factor 1
    memory-size 8G
    default-ttl 0

    storage-engine device {
        #device /dev/xvdj
        file /asdb/aerospike/data/cdbcontent.dat
        filesize 40G
        data-in-memory false
    }
}

Critical Logs:

Warning Logs:

Oct 28 2020 18:09:59 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 172.31.14.210:3002: timed out

Oct 28 2020 18:10:30 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 172.31.14.210:3002: timed out
Oct 28 2020 18:11:02 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 172.31.14.210:3002: timed out
Oct 28 2020 18:11:33 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 172.31.14.210:3002: timed out
Oct 28 2020 18:12:05 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 172.31.14.210:3002: timed out
Oct 28 2020 18:12:36 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 172.31.14.210:3002: timed out
Oct 28 2020 18:13:02 GMT: WARNING (as): (signal.c::171) SIGTERM received, shutting down
Oct 28 2020 18:14:49 GMT: WARNING (hb): (hb.c::1331) rejected tip for self: 172.31.7.207:3002
Oct 28 2020 18:14:49 GMT: WARNING (hb): (hb.c::2127) couldn't add remote heartbeat service 172.31.7.207:3002 to mesh host list
Oct 28 2020 18:21:25 GMT: WARNING (as): (signal.c::171) SIGTERM received, shutting down

Info Logs:

Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::137) histogram dump: udf (523 total) msec
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::154) (10: 0000000005) (11: 0000000009) (17: 0000000048) (18: 0000000075)
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::154) (19: 0000000137) (20: 0000000174) (21: 0000000071) (22: 0000000004)
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::137) histogram dump: query (150 total) msec
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::154) (00: 0000000109) (01: 0000000010) (02: 0000000021) (03: 0000000003)
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::154) (04: 0000000003) (06: 0000000001) (07: 0000000002) (17: 0000000001)
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (149 total) count
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::154) (01: 0000000087) (03: 0000000001) (04: 0000000009) (05: 0000000015)
Oct 29 2020 15:54:43 GMT: INFO (info): (hist.c::154) (06: 0000000015) (07: 0000000017) (11: 0000000003) (13: 0000000002)
Oct 29 2020 15:54:46 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /asdb/aerospike/data/cdbcontent.dat: used 727486848, contig-free 40261M (40261 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 9 (0.0/s) defrag-w-tot 0 (0.0/s)
^[[AOct 29 2020 15:54:47 GMT: INFO (drv_ssd): (drv_ssd.c::2088) device /asdb/aerospike/data/cdbapplication.dat: used 122577536, contig-free 20359M (20359 wblocks), swb-free 0, w-q 0 w-tot 0 (0.0/s), defrag-q 0 defrag-tot 8 (0.0/s) defrag-w-tot 0 (0.0/s)
Oct 29 2020 15:54:51 GMT: INFO (nsup): (thr_nsup.c::1206) {cdbapplication} nsup start
Oct 29 2020 15:54:51 GMT: INFO (nsup): (thr_nsup.c::1130) {cdbapplication} Records: 247394, 247394 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 52 ms
Oct 29 2020 15:54:51 GMT: INFO (nsup): (thr_nsup.c::1206) {cdbcontent} nsup start
Oct 29 2020 15:54:51 GMT: INFO (nsup): (thr_nsup.c::1130) {cdbcontent} Records: 619669, 619669 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 138 ms
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::4979) system memory: free 14987136kb ( 95 percent free ) 
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::4985) ClusterSize 1 ::: objects 867063 ::: sub_objects 0
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::4994) rec refs 867063 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::4999) replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0 
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5009) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (9, 10043, 10034) : hb (0, 0, 0) : fab (16, 16, 0)
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5011) heartbeat_received: self 0 : foreign 0
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5012) heartbeat_stats: bt 0 bf 0 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0 
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5024) tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5041) {cdbapplication} disk bytes used 122577536 : avail pct 99
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5043) {cdbapplication} memory bytes used 82020200 (index 15833216 : sindex 6380887 : data 59806097) : used pct 0.38
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5091) {cdbapplication} migrations - complete
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5053) {cdbcontent} disk bytes used 727486848 : avail pct 98 : cache-read pct 0.00
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5055) {cdbcontent} memory bytes used 76232989 (index 39658816 : sindex 36574173) : used pct 0.89
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5091) {cdbcontent} migrations - complete
Oct 29 2020 15:54:53 GMT: INFO (info): (thr_info.c::5098) partitions: actual 8192 sync 0 desync 0 zombie 0 absent 0
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::137) histogram dump: reads (52440 total) msec
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (00: 0000021699) (01: 0000009918) (02: 0000013232) (03: 0000006376)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (06: 0000001089) (07: 0000000012) (17: 0000000001) (18: 0000000002)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::163) (19: 0000000109) (20: 0000000001) (21: 0000000001)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (34 total) msec
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::163) (00: 0000000034)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::137) histogram dump: udf (523 total) msec
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (10: 0000000005) (11: 0000000009) (17: 0000000048) (18: 0000000075)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (19: 0000000137) (20: 0000000174) (21: 0000000071) (22: 0000000004)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::137) histogram dump: query (150 total) msec
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (00: 0000000109) (01: 0000000010) (02: 0000000021) (03: 0000000003)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (04: 0000000003) (06: 0000000001) (07: 0000000002) (17: 0000000001)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (149 total) count
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (01: 0000000087) (03: 0000000001) (04: 0000000009) (05: 0000000015)
Oct 29 2020 15:54:53 GMT: INFO (info): (hist.c::154) (06: 0000000015) (07: 0000000017) (11: 0000000003) (13: 0000000002)

The logs indicate SIGTERM seemingly due to heartbeat related connect failures. Suggest you upgrade to 3.13 or later version where the cluster protocol has changed and is more robust. The server version 3.6.0 is really old. Please let us know if the problem persists.

Hi neelp,

Thankyou for a quick response.

We are running it with NodeJs v0.12 with Aerospike client v1.0.31

If we change DB version, would it have impact on the nodejs side, w.r.t. Nodejs version, nodejs client version and UDFs used?

Can this be worked out w/o the aerospike upgrade? This has been running for a while now. All of a sudden the issues started popping up.

Thanks in advance, Manik Mittal

A suggestion from engineering is to check if you may have scripts that are sending the SIGTERM. If so, that may allow you more time on this (really old) version.

[If you upgrade, note you cannot upgrade past 3.13 without first doing the protocol changes in 3.13. Upgrading past 3.13 (without stopping at 3.13) requires a full cluster shutdown and then start. If you do that, suggest you upgrade to a 4.x version before the device format change.]

Depending on which version you upgrade to, please check the doc for compatibility of nodejs client and UDFs with the server version.

This is a good reference for what to watch out for. Probably not 100% complete but something to start with. https://www.aerospike.com/docs/client/java/usage/incompatible.html

A sigterm could be an OOM-killer couldnt it? Anything in dmesg/syslogs that indicates that?

Nah, OOM-killer is similar to SIGKILL, the process isn’t consulted in the matter. So it wouldn’t be able to log anything about it.

The memory and disk free space happens to be sufficient. Did not notice anything like that.

Also, the SIGTERM is shown as I had restarted the aerospike service after I could not log in to aql.

Probably worthwhile upgrading first prior to troubleshooting further given all the enhancements in the clustering protocol… even if a simple config issue, would be much easier to see on the latest versions.

None of the shared logs seem to show a problem… logs only seem to show once a node is alone and then being explicitly shut down…

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.