Hi,
At my company we are evaluating aerospike. We have a 2 node setup in aws using the aerospike ami backed with bcache. One of this nodes gets unresponsive every week. The cluster has a very low load, about 10 tps more or less. When the node crashes I can not even ssh on it. I get several alerts from nagios like cpu load to high, too much open files, etc. My only solution is to reboot the node. Then everything goes fine.
- Instances are R3.Large
- Not on a VPC but in the same region
- EBS is bigger than the ephemeral ssd raid
- Very small data set 3GB Disk, 700MB Ram
Any recommendations to find the issue? I did not found any critical messages on the logs but If I grep for warnings this is what I get:
aerospike.log-20150506:94:Apr 15 2015 05:52:39 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
aerospike.log-20150506:145:Apr 15 2015 05:53:05 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
aerospike.log-20150506:182:Apr 15 2015 05:53:32 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
aerospike.log-20150506:237:Apr 15 2015 05:53:57 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150609.gz:55655:Jun 08 2015 09:40:05 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:55656:Jun 08 2015 09:40:05 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:55696:Jun 08 2015 09:40:09 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150609.gz:55722:Jun 08 2015 09:40:24 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150609.gz:70858:Jun 08 2015 13:12:12 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:70859:Jun 08 2015 13:12:12 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:75428:Jun 08 2015 13:45:23 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:75429:Jun 08 2015 13:45:23 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:199688:Jun 09 2015 02:54:31 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:199689:Jun 09 2015 02:54:31 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:199770:Jun 09 2015 02:55:00 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199773:Jun 09 2015 02:55:00 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199776:Jun 09 2015 02:55:00 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199777:Jun 09 2015 02:55:01 GMT: WARNING (smd): (system_metadata.c::1914) Null response message passed in transaction complete!
aerospike.log-20150609.gz:199780:Jun 09 2015 02:55:02 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199783:Jun 09 2015 02:55:02 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199784:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199785:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199786:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199787:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199788:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199789:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199790:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199791:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150619.gz:113415:Jun 18 2015 13:52:13 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150619.gz:113430:Jun 18 2015 13:52:17 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150619.gz:113431:Jun 18 2015 13:52:17 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150619.gz:113453:Jun 18 2015 13:52:17 GMT: WARNING (paxos): (paxos.c::1591) No changes applied on paxos confirmation message, principal is bb93bcc0f0a0022. No sync messages will be sent
aerospike.log-20150622:228534:Jun 22 2015 09:35:47 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150622:228535:Jun 22 2015 09:35:47 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150622:228597:Jun 22 2015 09:36:09 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150623.gz:1380:Jun 22 2015 13:53:48 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150623.gz:3117:Jun 22 2015 14:09:22 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150623.gz:3118:Jun 22 2015 14:09:22 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150623.gz:3158:Jun 22 2015 14:09:28 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150623.gz:3207:Jun 22 2015 14:09:55 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150623.gz:3280:Jun 22 2015 14:10:21 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150627.gz:131929:Jun 26 2015 16:02:15 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150627.gz:131930:Jun 26 2015 16:02:15 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150628.gz:86335:Jun 27 2015 14:36:11 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150628.gz:86336:Jun 27 2015 14:36:11 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...