Aerospike cluster crashed after index creation

aws

#1

We have a cluster at AWS of 4 machines t2micro (1cpu 1gb ram 15gb ssd) and we were testing aerospike. We used the aws marketplace AMI to install aerospike v3 community edition, and configured only the aerospike.conf file to have a namespace on the disk.

We had one namespace with two sets, totaling 18M documents, 2gb ram occupied and aprox 40gb of disk space occupied. After the creation of an index in a 12M records set the system crashed.

Some info:

aql on the instance:

[ec2-user@ip-172-XX-XX-XXX ~]$ aql
2015-09-16 18:44:37 WARN AEROSPIKE_ERR_CLIENT Socket write error: 111
Error -1: Failed to seed cluster*

Tail of the log: (it keeps adding only lines repeated)

Sep 16 2015 19:08:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:08:46 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:06 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)

asmonitor:

$ asmonitor -h 54.XX.XXX.XX
request to 54.XX.XXX.XX : 3000 returned error
skipping 54.XX.XXX.XX:3000
***failed to connect to any hosts

asadm:

$ asadm -h 54.XXX.XXX.XX -p 3000
Aerospike Interactive Shell, version 0.0.10-6-gdd6fb61
Found 1 nodes
Offline: 54.XXX.XXX.XX:3000

We tried restarting the instances, one of them is back but working as a standalone node, the rest are in the described state. The instances are working, but the aerospike service is not.


#2

The log lines you provided are periodic, 20 seconds, messages from the defrag workers and are normal. Those being the lines printed to the logs would not be normal, can you confirm that this is the case?

Also What version of Aerospike is running on that node?

yum list installed | grep aerospike

#3

Thanks for your attention,

If I understood your question, I would say maybe those lines are not normal in this case, because there are only those lines in the log in the moment, lots and lots of lines repeated like that ( /var/log/aerospike/aerospike.log ) .

Another instance of the same cluster that came back to life as a standalone cluster presents the following lines in the end of the log:

Sep 16 2015 20:24:14 GMT: INFO (info): (thr_info.c::4870)    heartbeat_stats: bt 0 bf 0 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 1 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 12939 rc 12939 
Sep 16 2015 20:24:14 GMT: INFO (info): (thr_info.c::4882)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Sep 16 2015 20:24:14 GMT: INFO (info): (thr_info.c::4923) namespace teste: disk inuse: 6979354496 memory inuse: 267396864 (bytes) sindex memory inuse: 0 (bytes) avail pct 45 cache-read pct 0.00
Sep 16 2015 20:24:14 GMT: INFO (info): (thr_info.c::4943)    partitions: actual 4096 sync 0 desync 0 zombie 0 wait 0 absent 0
Sep 16 2015 20:24:14 GMT: INFO (info): (hist.c::137) histogram dump: reads (0 total) msec
Sep 16 2015 20:24:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (0 total) msec
Sep 16 2015 20:24:14 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Sep 16 2015 20:24:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (0 total) msec
Sep 16 2015 20:24:14 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Sep 16 2015 20:24:14 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Sep 16 2015 20:24:14 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count

also:

[ec2-user@ip-172-XX-XX-XXX ~]$ yum list installed | grep aerospike
aerospike-amc-community.x86_64        3.6.2-el5                    installed    
aerospike-server-community.x86_64     3.6.0-1.el6                  installed    
aerospike-tools.x86_64                3.6.0-1.el6                  installed 

If there is anything else I can do to get more information, please let me know

thank you


#4

Those log lines are also normal tickers.

Could you send the logs from a node that crashed around the time of the crash. You could filter out thr_info.c, hist.c, and drv_ssd.c if you like.


#5

hello again I looked at the log and found some strange lines:

Sep 16 2015 15:24:09 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.11:3002: timed out
Sep 16 2015 15:24:18 GMT: INFO (hb): (hb.c::2202) hb considers expiring: now 140507878 last 140505346
Sep 16 2015 15:24:18 GMT: INFO (hb): (hb.c::2215) hb expires but fabric says DEAD: node bb94d8d470e4a02
Sep 16 2015 15:24:18 GMT: INFO (hb): (hb.c::2395) removing node on heartbeat failure: bb94d8d470e4a02
Sep 16 2015 15:24:18 GMT: INFO (fabric): (fabric.c::1824) fabric: node bb94d8d470e4a02 departed
Sep 16 2015 15:24:19 GMT: INFO (fabric): (fabric.c::1734) fabric disconnecting node: bb94d8d470e4a02
Sep 16 2015 15:24:19 GMT: INFO (paxos): (paxos.c::1931) as_paxos_spark establishing transaction [4.0]@bb98ff734d76502 ClSz = 4 ; # change = 1 : DEL bb94d8d470e4a02; 
Sep 16 2015 15:24:19 GMT: INFO (partition): (partition.c::392) DISALLOW MIGRATIONS
Sep 16 2015 15:24:19 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb97825e9569902
Sep 16 2015 15:24:19 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9284d86eee202
Sep 16 2015 15:24:19 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [4.0]@bb98ff734d76502: bb98ff734d76502 bb97825e9569902 bb9284d86eee202 
Sep 16 2015 15:24:20 GMT: INFO (paxos): (paxos.c::2900) Received paxos partition sync request from bb97825e9569902
Sep 16 2015 15:24:20 GMT: INFO (paxos): (paxos.c::2900) Received paxos partition sync request from bb9284d86eee202
Sep 16 2015 15:24:20 GMT: INFO (paxos): (paxos.c::2929) All partition data has been received by principal
Sep 16 2015 15:24:20 GMT: INFO (paxos): (paxos.c::2932) Sending paxos partition sync message to bb97825e9569902
Sep 16 2015 15:24:20 GMT: INFO (paxos): (paxos.c::2932) Sending paxos partition sync message to bb9284d86eee202
Sep 16 2015 15:24:20 GMT: INFO (config): (cluster_config.c::406) Rack Aware is disabled.
Sep 16 2015 15:24:20 GMT: INFO (partition): (cluster_config.c::368) Rack Aware is disabled.
Sep 16 2015 15:24:20 GMT: INFO (partition): (partition.c::2906) CLUSTER SIZE = 3
Sep 16 2015 15:24:20 GMT: INFO (partition): (partition.c::2945) Global state is well formed
Sep 16 2015 15:24:20 GMT: INFO (paxos): (partition.c::2607) setting replication factors: cluster size 3, paxos single replica limit 1
Sep 16 2015 15:24:20 GMT: INFO (paxos): (partition.c::2614) {teste} replication factor is 1
Sep 16 2015 15:24:20 GMT: INFO (partition): (partition.c::3837) {teste} re-balanced, expect migrations: out 0, in 0, out-later 0
Sep 16 2015 15:24:20 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testewebsiteidindextsk" value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal"!
Sep 16 2015 15:24:20 GMT: INFO (paxos): (partition.c::3861) global partition state: total 4096 lost 1054 unique 3042 duplicate 0
Sep 16 2015 15:24:20 GMT: INFO (paxos): (partition.c::3862) partition state after fixing lost partitions (master): total 4096 lost 0 unique 4096 duplicate 0
Sep 16 2015 15:24:20 GMT: INFO (paxos): (partition.c::3863) 706 new partition version tree paths generated
Sep 16 2015 15:24:20 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testedateindextsk" value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal"!
Sep 16 2015 15:24:20 GMT: INFO (partition): (partition.c::371) ALLOW MIGRATIONS
Sep 16 2015 15:24:20 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:idx_testes_myset_DomainUserId" value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal"!
Sep 16 2015 15:24:20 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testesessionindextsk" value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal"!
Sep 16 2015 15:24:20 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testewebsiteidindextsk" value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal"!
Sep 16 2015 15:24:20 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testedateindextsk" value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:idx_testes_myset_DomainUserId" value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testesessionindextsk" value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testewebsiteidindextsk" value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testedateindextsk" value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:idx_testes_myset_DomainUserId" value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testesessionindextsk" value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testewebsiteidindextsk" value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testedateindextsk" value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:idx_testes_myset_DomainUserId" value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testesessionindextsk" value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testewebsiteidindextsk" value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testedateindextsk" value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:idx_testes_myset_DomainUserId" value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB9284D86EEE202 module: "sindex_module" key: "teste:testesessionindextsk" value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testewebsiteidindextsk" value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testewebsiteidindextsk;numbins=1;indexdata=ClientId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testedateindextsk" value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testedateindextsk;numbins=1;indexdata=EntryDateTime,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:idx_testes_myset_DomainUserId" value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=myset;indexname=idx_testes_myset_DomainUserId;numbins=1;indexdata=DomainUserId,STRING;priority=normal"!
Sep 16 2015 15:24:21 GMT: WARNING (smd): (system_metadata.c::2583) found existing metadata item: node: 0BB97825E9569902 module: "sindex_module" key: "teste:testesessionindextsk" value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal" ~~ Replacing with value: "ns=teste;set=danielset;indexname=testesessionindextsk;numbins=1;indexdata=SessionId,STRING;priority=normal"!
Sep 16 2015 15:24:29 GMT: INFO (nsup): (thr_nsup.c::1237) {teste} Records: 4550268, 4550268 0-vt, 0(0) expired, 0(0) evicted, 0(2728322) set deletes, 0(0) set evicted. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 243716 ms
Sep 16 2015 15:24:30 GMT: INFO (nsup): (thr_nsup.c::1319) {teste} nsup start
Sep 16 2015 15:24:30 GMT: INFO (namespace): (namespace.c::448) {teste} hwm_breached true (disk), stop_writes false, memory sz:465959252 (291217152 + 0) hwm:566231040 sw:849346560, disk sz:7635163136 hwm:6979321856
Sep 16 2015 15:24:34 GMT: INFO (nsup): (thr_nsup.c::1171) {teste} evicting using ttl histogram
Sep 16 2015 15:24:34 GMT: WARNING (nsup): (thr_nsup.c::1177) {teste} can't evict - no records eligible
Sep 16 2015 15:24:34 GMT: INFO (nsup): (thr_nsup.c::1504) {teste} evict ttls 0,0,0.000
Sep 16 2015 15:24:36 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.11:3002: timed out
Sep 16 2015 15:24:44 GMT: INFO (nsup): (thr_nsup.c::1237) {teste} Records: 4550268, 4550268 0-vt, 0(0) expired, 0(0) evicted, 0(2728322) set deletes, 0(0) set evicted. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 14065 ms

And some time after that:

Sep 16 2015 15:25:52 GMT: INFO (hb): (hb.c::2202) hb considers expiring: now 140602129 last 140597393
Sep 16 2015 15:25:54 GMT: INFO (hb): (hb.c::2215) hb expires but fabric says DEAD: node bb97825e9569902
Sep 16 2015 15:25:54 GMT: INFO (hb): (hb.c::1501) as_hb_tcp_send cf_socket_sendto() fd 183 failed
Sep 16 2015 15:25:54 GMT: INFO (hb): (hb.c::1501) as_hb_tcp_send cf_socket_sendto() fd 215 failed
Sep 16 2015 15:25:55 GMT: INFO (hb): (hb.c::2395) removing node on heartbeat failure: bb97825e9569902
Sep 16 2015 15:25:55 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.11:3002: timed out
Sep 16 2015 15:25:57 GMT: INFO (fabric): (fabric.c::1824) fabric: node bb97825e9569902 departed
Sep 16 2015 15:25:57 GMT: INFO (fabric): (fabric.c::1734) fabric disconnecting node: bb97825e9569902
Sep 16 2015 15:25:58 GMT: INFO (hb): (hb.c::2332) HB node bb9284d86eee202 in different cluster - succession lists don't match
Sep 16 2015 15:25:59 GMT: INFO (hb): (hb.c::2401) new heartbeat received: bb97825e9569902 principal node is bb97825e9569902
Sep 16 2015 15:25:58 GMT: INFO (paxos): (paxos.c::1931) as_paxos_spark establishing transaction [5.0]@bb98ff734d76502 ClSz = 3 ; # change = 1 : DEL bb97825e9569902; 
Sep 16 2015 15:25:59 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9284d86eee202 and self bb98ff734d76502
Sep 16 2015 15:25:59 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb97825e9569902 and self bb98ff734d76502
Sep 16 2015 15:25:59 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb97825e9569902,bb9284d86eee202
Sep 16 2015 15:25:59 GMT: INFO (fabric): (fabric.c::1818) fabric: node bb97825e9569902 arrived
Sep 16 2015 15:25:59 GMT: INFO (paxos): (paxos.c::1093) Node bb97825e9569902 revived

And:

Sep 16 2015 15:26:03 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9284d86eee202 and self bb98ff734d76502
Sep 16 2015 15:26:03 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb97825e9569902 and self bb98ff734d76502
Sep 16 2015 15:26:03 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb97825e9569902,bb9284d86eee202
Sep 16 2015 15:26:09 GMT: INFO (hb): (hb.c::2202) hb considers expiring: now 140619415 last 140616598
Sep 16 2015 15:26:16 GMT: INFO (hb): (hb.c::2215) hb expires but fabric says DEAD: node bb9284d86eee202
Sep 16 2015 15:26:17 GMT: INFO (hb): (hb.c::2202) hb considers expiring: now 140626984 last 140615403
Sep 16 2015 15:26:17 GMT: INFO (hb): (hb.c::2215) hb expires but fabric says DEAD: node bb97825e9569902
Sep 16 2015 15:26:18 GMT: INFO (hb): (hb.c::1501) as_hb_tcp_send cf_socket_sendto() fd 73 failed
Sep 16 2015 15:26:18 GMT: INFO (hb): (hb.c::2395) removing node on heartbeat failure: bb9284d86eee202
Sep 16 2015 15:26:18 GMT: INFO (hb): (hb.c::2395) removing node on heartbeat failure: bb97825e9569902
Sep 16 2015 15:26:18 GMT: INFO (fabric): (fabric.c::1824) fabric: node bb9284d86eee202 departed
Sep 16 2015 15:26:18 GMT: INFO (fabric): (fabric.c::1734) fabric disconnecting node: bb9284d86eee202
Sep 16 2015 15:26:18 GMT: INFO (fabric): (fabric.c::1824) fabric: node bb97825e9569902 departed
Sep 16 2015 15:26:18 GMT: INFO (fabric): (fabric.c::1734) fabric disconnecting node: bb97825e9569902
Sep 16 2015 15:26:19 GMT: INFO (paxos): (paxos.c::1931) as_paxos_spark establishing transaction [6.0]@bb98ff734d76502 ClSz = 3 ; # change = 2 : DEL bb9284d86eee202; DEL bb97825e9569902; 
Sep 16 2015 15:26:19 GMT: INFO (hb): (hb.c::2401) new heartbeat received: bb97825e9569902 principal node is bb97825e9569902
Sep 16 2015 15:26:20 GMT: INFO (fabric): (fabric.c::1818) fabric: node bb97825e9569902 arrived
Sep 16 2015 15:26:20 GMT: INFO (hb): (hb.c::1501) as_hb_tcp_send cf_socket_sendto() fd 77 failed
Sep 16 2015 15:26:20 GMT: INFO (paxos): (paxos.c::894) Departed node bb97825e9569902 is not found in paxos succession list
Sep 16 2015 15:26:20 GMT: INFO (partition): (partition.c::392) DISALLOW MIGRATIONS
Sep 16 2015 15:26:20 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [6.0]@bb98ff734d76502: bb98ff734d76502 
Sep 16 2015 15:26:20 GMT: INFO (paxos): (paxos.c::2775) SINGLE NODE CLUSTER!!!
Sep 16 2015 15:26:20 GMT: INFO (hb): (hb.c::2401) new heartbeat received: bb9284d86eee202 principal node is bb97825e9569902
Sep 16 2015 15:26:21 GMT: INFO (fabric): (fabric.c::1818) fabric: node bb9284d86eee202 arrived
Sep 16 2015 15:26:21 GMT: INFO (config): (cluster_config.c::406) Rack Aware is disabled.
Sep 16 2015 15:26:21 GMT: INFO (partition): (cluster_config.c::368) Rack Aware is disabled.
Sep 16 2015 15:26:21 GMT: INFO (partition): (partition.c::2906) CLUSTER SIZE = 1
Sep 16 2015 15:26:21 GMT: INFO (partition): (partition.c::2945) Global state is well formed
Sep 16 2015 15:26:21 GMT: INFO (paxos): (partition.c::2607) setting replication factors: cluster size 1, paxos single replica limit 1
Sep 16 2015 15:26:21 GMT: INFO (paxos): (partition.c::2614) {teste} replication factor is 1
Sep 16 2015 15:26:23 GMT: INFO (partition): (partition.c::3837) {teste} re-balanced, expect migrations: out 0, in 0, out-later 0
Sep 16 2015 15:26:25 GMT: INFO (paxos): (partition.c::3861) global partition state: total 4096 lost 2726 unique 1370 duplicate 0
Sep 16 2015 15:26:25 GMT: INFO (paxos): (partition.c::3862) partition state after fixing lost partitions (master): total 4096 lost 0 unique 4096 duplicate 0
Sep 16 2015 15:26:25 GMT: INFO (paxos): (partition.c::3863) 0 new partition version tree paths generated
Sep 16 2015 15:26:24 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.11:3002: timed out
Sep 16 2015 15:26:25 GMT: INFO (partition): (partition.c::371) ALLOW MIGRATIONS
Sep 16 2015 15:26:28 GMT: INFO (hb): (hb.c::2202) hb considers expiring: now 140638499 last 140635916
Sep 16 2015 15:26:30 GMT: INFO (nsup): (thr_nsup.c::1319) {teste} nsup start
Sep 16 2015 15:26:31 GMT: INFO (paxos): (paxos.c::1931) as_paxos_spark establishing transaction [7.0]@bb98ff734d76502 ClSz = 1 ; # change = 1 : ADD bb97825e9569902; 
Sep 16 2015 15:26:33 GMT: INFO (hb): (hb.c::2215) hb expires but fabric says DEAD: node bb97825e9569902
Sep 16 2015 15:26:34 GMT: INFO (hb): (hb.c::2395) removing node on heartbeat failure: bb97825e9569902
Sep 16 2015 15:26:34 GMT: INFO (fabric): (fabric.c::1824) fabric: node bb97825e9569902 departed
Sep 16 2015 15:26:34 GMT: INFO (fabric): (fabric.c::1734) fabric disconnecting node: bb97825e9569902
Sep 16 2015 15:26:35 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9284d86eee202 and self bb98ff734d76502
Sep 16 2015 15:26:35 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb97825e9569902,bb9284d86eee202
Sep 16 2015 15:26:35 GMT: INFO (hb): (hb.c::2202) hb considers expiring: now 140644686 last 140636166
Sep 16 2015 15:26:35 GMT: INFO (hb): (hb.c::2215) hb expires but fabric says DEAD: node bb9284d86eee202
Sep 16 2015 15:26:35 GMT: INFO (hb): (hb.c::2395) removing node on heartbeat failure: bb9284d86eee202
Sep 16 2015 15:26:35 GMT: INFO (fabric): (fabric.c::1824) fabric: node bb9284d86eee202 departed
Sep 16 2015 15:26:35 GMT: INFO (fabric): (fabric.c::1734) fabric disconnecting node: bb9284d86eee202
Sep 16 2015 15:26:35 GMT: INFO (namespace): (namespace.c::448) {teste} hwm_breached true (disk), stop_writes false, memory sz:470532071 (291217152 + 0) hwm:566231040 sw:849346560, disk sz:7635163136 hwm:6979321856
Sep 16 2015 15:26:35 GMT: WARNING (fabric): (fabric.c::2443) No fabric transmit structure in global hash for fabric transaction-id 49
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::1931) as_paxos_spark establishing transaction [8.0]@bb98ff734d76502 ClSz = 1 ; # change = 1 : ADD bb9284d86eee202; 
Sep 16 2015 15:26:36 GMT: INFO (hb): (hb.c::2401) new heartbeat received: bb97825e9569902 principal node is bb97825e9569902
Sep 16 2015 15:26:36 GMT: INFO (fabric): (fabric.c::1818) fabric: node bb97825e9569902 arrived
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb97825e9569902 and self bb98ff734d76502
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb97825e9569902,bb9284d86eee202
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb97825e9569902 and self bb98ff734d76502
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb97825e9569902,bb9284d86eee202
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb97825e9569902 and self bb98ff734d76502
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb97825e9569902,bb9284d86eee202
Sep 16 2015 15:26:36 GMT: INFO (paxos): (paxos.c::1931) as_paxos_spark establishing transaction [9.0]@bb98ff734d76502 ClSz = 1 ; # change = 1 : ADD bb97825e9569902; 
Sep 16 2015 15:26:37 GMT: INFO (partition): (partition.c::392) DISALLOW MIGRATIONS
Sep 16 2015 15:26:36 GMT: INFO (hb): (hb.c::1501) as_hb_tcp_send cf_socket_sendto() fd 73 failed
Sep 16 2015 15:26:36 GMT: INFO (hb): (hb.c::2202) hb considers expiring: now 140646412 last 140638777
Sep 16 2015 15:26:37 GMT: INFO (hb): (hb.c::2224) hb expires but fabric says ALIVE: lasttime 1084 node bb97825e9569902
Sep 16 2015 15:26:37 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb97825e9569902
Sep 16 2015 15:26:37 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [7.0]@bb98ff734d76502: bb98ff734d76502 bb97825e9569902 
Sep 16 2015 15:26:37 GMT: WARNING (fabric): (fabric.c::2443) No fabric transmit structure in global hash for fabric transaction-id 53
Sep 16 2015 15:26:37 GMT: WARNING (fabric): (fabric.c::2443) No fabric transmit structure in global hash for fabric transaction-id 51
Sep 16 2015 15:26:37 GMT: WARNING (fabric): (fabric.c::2443) No fabric transmit structure in global hash for fabric transaction-id 53
Sep 16 2015 15:26:38 GMT: INFO (partition): (partition.c::392) DISALLOW MIGRATIONS
Sep 16 2015 15:26:38 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb97825e9569902
Sep 16 2015 15:26:38 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9284d86eee202
Sep 16 2015 15:26:38 GMT: WARNING (paxos): (paxos.c::1441) sync message lost in fabric
Sep 16 2015 15:26:38 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [8.0]@bb98ff734d76502: bb98ff734d76502 bb97825e9569902 bb9284d86eee202 
Sep 16 2015 15:26:38 GMT: INFO (paxos): (paxos.c::2900) Received paxos partition sync request from bb97825e9569902
Sep 16 2015 15:26:38 GMT: INFO (hb): (hb.c::2401) new heartbeat received: bb9284d86eee202 principal node is bb97825e9569902
Sep 16 2015 15:26:38 GMT: INFO (fabric): (fabric.c::1818) fabric: node bb9284d86eee202 arrived
Sep 16 2015 15:26:38 GMT: WARNING (cf:msg): (msg.c:msg_parse:332) malloc
Sep 16 2015 15:26:39 GMT: WARNING (as): (signal.c::161) SIGSEGV received, aborting Aerospike Community Edition build 3.6.0 os el6
Sep 16 2015 15:26:41 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb9284d86eee202 and self bb98ff734d76502
Sep 16 2015 15:26:41 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb97825e9569902,bb9284d86eee202
Sep 16 2015 15:26:41 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb98ff734d76502 retransmitting sync messages to nodes that have not responded yet ... 
Sep 16 2015 15:26:42 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb9284d86eee202
Sep 16 2015 15:26:44 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [8.0]@bb98ff734d76502: bb98ff734d76502 bb97825e9569902 bb9284d86eee202 
Sep 16 2015 15:26:45 GMT: INFO (paxos): (paxos.c::2900) Received paxos partition sync request from bb9284d86eee202
Sep 16 2015 15:26:47 GMT: INFO (paxos): (paxos.c::2929) All partition data has been received by principal
Sep 16 2015 15:26:47 GMT: INFO (paxos): (paxos.c::2932) Sending paxos partition sync message to bb97825e9569902
Sep 16 2015 15:26:51 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.11:3002: timed out
Sep 16 2015 15:26:53 GMT: INFO (paxos): (paxos.c::2932) Sending paxos partition sync message to bb9284d86eee202
Sep 16 2015 15:26:58 GMT: INFO (config): (cluster_config.c::406) Rack Aware is disabled.
Sep 16 2015 15:26:59 GMT: INFO (partition): (cluster_config.c::368) Rack Aware is disabled.
Sep 16 2015 15:26:59 GMT: INFO (partition): (partition.c::2906) CLUSTER SIZE = 3
Sep 16 2015 15:26:59 GMT: INFO (partition): (partition.c::2945) Global state is well formed
Sep 16 2015 15:26:59 GMT: INFO (paxos): (partition.c::2607) setting replication factors: cluster size 3, paxos single replica limit 1
Sep 16 2015 15:26:59 GMT: INFO (paxos): (partition.c::2614) {teste} replication factor is 1
Sep 16 2015 15:27:00 GMT: INFO (partition): (partition.c::3837) {teste} re-balanced, expect migrations: out 2726, in 1370, out-later 0
Sep 16 2015 15:27:01 GMT: INFO (paxos): (partition.c::3861) global partition state: total 4096 lost 0 unique 0 duplicate 4096
Sep 16 2015 15:27:01 GMT: INFO (paxos): (partition.c::3862) partition state after fixing lost partitions (master): total 4096 lost 0 unique 0 duplicate 4096
Sep 16 2015 15:27:01 GMT: INFO (paxos): (partition.c::3863) 0 new partition version tree paths generated
Sep 16 2015 15:27:01 GMT: INFO (partition): (partition.c::371) ALLOW MIGRATIONS
Sep 16 2015 15:27:05 GMT: INFO (paxos): (paxos.c::2900) Received paxos partition sync request from bb97825e9569902
Sep 16 2015 15:27:05 GMT: CRITICAL (migrate): (migrate.c:migrate_msg_fn:1476) malloc
Sep 16 2015 15:27:06 GMT: CRITICAL (migrate): (migrate.c:migrate_msg_fn:1476) malloc
Sep 16 2015 15:27:08 GMT: INFO (paxos): (paxos.c::2920) Re-sending paxos partition sync message to bb97825e9569902
Sep 16 2015 15:27:11 GMT: WARNING (as): (signal.c::163) stacktrace: found 0 frames
Sep 16 2015 15:27:15 GMT: INFO (paxos): (paxos.c::2900) Received paxos partition sync request from bb9284d86eee202
Sep 16 2015 15:27:20 GMT: INFO (paxos): (paxos.c::2920) Re-sending paxos partition sync message to bb9284d86eee202
Sep 16 2015 15:27:19 GMT: WARNING (as): (signal.c::94) SIGABRT received, aborting Aerospike Community Edition build 3.6.0 os el6
Sep 16 2015 15:27:21 GMT: WARNING (as): (signal.c::96) stacktrace: found 0 frames
Sep 16 2015 15:27:19 GMT: WARNING (as): (signal.c::94) SIGABRT received, aborting Aerospike Community Edition build 3.6.0 os el6
Sep 16 2015 15:27:21 GMT: WARNING (as): (signal.c::96) stacktrace: found 0 frames
Sep 16 2015 15:27:21 GMT: WARNING (as): (signal.c::77) could not register default signal handler for 6
Sep 16 2015 15:27:21 GMT: WARNING (as): (signal.c::163) stacktrace: found no symbols

#6

Looks like the heartbeats were failing to reach their destination in a timely manner. The networking on t2.micro instances hasn’t been very reliable in my experience. I would recommend using an instance type with better networking (t2.large have worked well for me). Also are you using our recommended heartbeat settings for AWS?