Aerospike Stops Abruptly with no error in logs


#1

Aerospike stops abruptly with no error message in logs

Description

Aerospike server in normal operation shuts down. There is no information in the log to indicate why this might be. The log shows normal messages and then nothing. Startup messages may be shown if the operations team has restarted the node.

Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::137) histogram dump: proxy (47166 total) msec
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::154)  (00: 0000042346) (01: 0000003405) (02: 0000000180) (03: 0000000174)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::154)  (04: 0000000215) (05: 0000000234) (06: 0000000278) (07: 0000000142)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::154)  (08: 0000000068) (09: 0000000064) (10: 0000000052) (11: 0000000008)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (1345946360 total) msec
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::154)  (00: 1252824283) (01: 0058777235) (02: 0021017740) (03: 0010035103)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::154)  (04: 0002737668) (05: 0000426331) (06: 0000042140) (07: 0000023558)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::154)  (08: 0000061791) (09: 0000000339) (10: 0000000084) (11: 0000000012)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::154)  (12: 0000000030) (13: 0000000016) (14: 0000000021) (15: 0000000006)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::163)  (17: 0000000003)
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Dec 23 2015 15:45:04 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count

Then next in the log, a restart.

Dec 23 2015 16:45:12 GMT: INFO (as): (as.c::373) <><><><><><><><><><>  Aerospike Enterprise Edition build 3.5.14  <><><><><><><><><><>
Dec 23 2015 16:45:12 GMT: INFO (config): (cfg.c::3104) service {
Dec 23 2015 16:45:12 GMT: INFO (config): (cfg.c::3104) 
Dec 23 2015 16:45:12 GMT: INFO (config): (cfg.c::3104)  user aerospike
Dec 23 2015 16:45:12 GMT: INFO (config): (cfg.c::3104)  group aerospike

Root Cause

Usually, when the Aerospike server stops logging abruptly the issue is the kernel on the machine running out of memory. In all other instances some kind of log message or signal is written to give an indication as to what might have happened. When a stack trace is generated this is normally written to the logs. To check to see if there is an out of memory error, the kernel logs should be examined.

$ grep -i oom kern.log

Dec 23 15:45:14 as08 kernel: [12702489.912374] asd invoked oom-killer: gfp_mask=0×201da, order=0, oom_adj=0, oom_score_adj=0
Dec 23 15:45:14 as08 kernel: [12702489.951674] [<ffffffff810b7c4f>] ? oom_kill_process+0x49/0×271
Dec 23 15:45:14 as08 kernel: [12702490.429034] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name

In the output above we see that this was indeed an out of memory issue.

The next step is to find out why memory ran out. The way to do this is to consider what was happening on the machine at the time. Usually, Aerospike will protect against running out of its own memory using features such as memory-high-watermarks and stop_writes mode so the cause is unlikely to be related to Aerospike.

One scenario where Aerospike activity could have an unexpected impact on memory usage is when a node migrates partitions out. The node will load a partition into memory for each configured migrate-thread and could exhaust all the memory available. More efficient memory allocation for migration handling are on the road map.