Aerospike service aborted (AER-3639)

crash

#1

Hi,

I noticed that one of my nodes was not showing up in asmonitor so I connected to it and found something odd. The root partition local disk was full and asd was down, however, very shortly after the root partition was back to normal. I would usually attribute this to log rotation but it didn’t look like the logs were particularly big or recently rotate. Is there anything that aerospike writes to the root disk (other than logs)? The only things running on that instance are asd, amc (without any clients attached) and diamond (a lightweight service that gathers system stats and sends them to graphite).

Here’s the aerospike log at the time:

Oct 08 2015 21:45:24 GMT: INFO (partition): (partition.c::2121) blocking until pending writes are done: {biddingdb:2609}, pstate 1, 1 to go 159 centisecs
Oct 08 2015 21:45:25 GMT: INFO (partition): (partition.c::2121) blocking until pending writes are done: {biddingdb:2609}, pstate 1, 1 to go 169 centisecs
Oct 08 2015 21:45:26 GMT: INFO (partition): (partition.c::2121) blocking until pending writes are done: {biddingdb:2609}, pstate 1, 1 to go 179 centisecs
Oct 08 2015 21:45:26 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/xvdb: used 141523437184, contig-free 401483M (802966 wblocks), swb-free 2, n-w 0, w-    q 0 w-tot 4950739 (11.6/s), defrag-q 0 defrag-tot 4546759 (11.4/s)
Oct 08 2015 21:45:27 GMT: INFO (partition): (partition.c::2121) blocking until pending writes are done: {biddingdb:2609}, pstate 1, 1 to go 189 centisecs
Oct 08 2015 21:45:27 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/xvdc: used 141387127168, contig-free 402152M (804305 wblocks), swb-free 1, n-w 0, w-    q 0 w-tot 4940578 (11.8/s), defrag-q 0 defrag-tot 4537311 (11.1/s)
Oct 08 2015 21:45:28 GMT: INFO (partition): (partition.c::2121) blocking until pending writes are done: {biddingdb:2609}, pstate 1, 1 to go 199 centisecs
Oct 08 2015 21:45:28 GMT: CRITICAL (partition): (partition.c:as_partition_migrate_tx:2127) in migrate_tx, pending writes never drained: {biddingdb:2609}
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::93) SIGABRT received, aborting Aerospike Community Edition build 3.5.15
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: found 9 frames
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x54) [0x46ea38]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7fa77dce3c30]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 2: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fa77dce3bb9]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 3: /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fa77dce6fc8]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x2a5) [0x4fd0ce]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 5: /usr/bin/asd(as_partition_migrate_tx+0x54d) [0x4db0f1]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 6: /usr/bin/asd(migrate_xmit_fn+0x719) [0x4d7f7b]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fa77ebbc182]
Oct 08 2015 21:45:28 GMT: WARNING (as): (signal.c::95) stacktrace: frame 8: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fa77dda7fbd]

And the logs dir which seems quite reasonable:

$ date                                                                                                                            
Thu Oct  8 22:30:30 UTC 2015     
$ ls -ltrah /var/log/aerospike                                                                                                    
total 362M                                                                                                                                                  
-rw-r--r--  1 root      root      1.4M Sep 25 06:25 aerospike.log-20150925.gz                                                                               
-rw-r--r--  1 root      root      2.2M Sep 26 06:49 aerospike.log-20150926.gz                                                                               
-rw-r--r--  1 root      root      2.1M Sep 27 06:45 aerospike.log-20150927.gz                                                                               
-rw-r--r--  1 root      root      2.1M Sep 28 06:34 aerospike.log-20150928.gz                                                                               
-rw-r--r--  1 root      root      2.1M Sep 29 06:30 aerospike.log-20150929.gz                                                                               
-rw-r--r--  1 root      root      3.1M Sep 30 06:43 aerospike.log-20150930.gz                                                                               
-rw-r--r--  1 root      root      4.7M Oct  1 06:42 aerospike.log-20151001.gz                                                                               
-rw-r--r--  1 root      root      7.8M Oct  2 06:35 aerospike.log-20151002.gz                                                                               
-rw-r--r--  1 root      root      8.5M Oct  3 06:39 aerospike.log-20151003.gz                                                                               
-rw-r--r--  1 root      root       29M Oct  4 06:30 aerospike.log-20151004.gz                                                                               
-rw-r--r--  1 root      root      142M Oct  5 06:41 aerospike.log-20151005.gz                                                                               
-rw-r--r--  1 root      root      130M Oct  6 06:52 aerospike.log-20151006.gz                                                                               
-rw-r--r--  1 root      root      2.7M Oct  7 06:43 aerospike.log-20151007.gz                                                                               
-rw-r--r--  1 root      root      2.7M Oct  8 06:30 aerospike.log-20151008.gz                                                                               
drwxr-xr-x  2 aerospike aerospike 4.0K Oct  8 06:30 .                                                                                                       
-rw-r--r--  1 root      root       25M Oct  8 21:45 aerospike.log                                                                                           
drwxrwxr-x 15 root      syslog    4.0K Oct  8 21:45 ..                                               

Any thoughts about why aerospike went down?

Thanks!


#2

This is the result of a race condition that we recently discovered that can be triggered when a new migration round is triggered before the prior one ends. This should be patch in a near future release and look out for AER-3539 in the server release notes.

The race condition was discovered while implementing new migrations stats that will tell you precisely how many partition migrations are remaining, this item is being tracked as JIRA ticket AER-3639.


#3

Cool - will keep an eye out for that update.


#4

@naoum,

We recently released Aerospike Tools 3.6.2, in which we did some of the prep work to fix AER-3639. You can read its release notes and download it here. The improvement is code complete pending review, expected to go out within the next couple releases. Stay tuned.

Regards,

Maud


#5

@Mnemaudsyne,

Great - will give it a try!

Thanks!