Node crash in AWS using the Aerospike AMI

ami
aws

#1

Hi,

At my company we are evaluating aerospike. We have a 2 node setup in aws using the aerospike ami backed with bcache. One of this nodes gets unresponsive every week. The cluster has a very low load, about 10 tps more or less. When the node crashes I can not even ssh on it. I get several alerts from nagios like cpu load to high, too much open files, etc. My only solution is to reboot the node. Then everything goes fine.

  • Instances are R3.Large
  • Not on a VPC but in the same region
  • EBS is bigger than the ephemeral ssd raid
  • Very small data set 3GB Disk, 700MB Ram

Any recommendations to find the issue? I did not found any critical messages on the logs but If I grep for warnings this is what I get:

aerospike.log-20150506:94:Apr 15 2015 05:52:39 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
aerospike.log-20150506:145:Apr 15 2015 05:53:05 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
aerospike.log-20150506:182:Apr 15 2015 05:53:32 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.10.10.10:3002: timed out
aerospike.log-20150506:237:Apr 15 2015 05:53:57 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150609.gz:55655:Jun 08 2015 09:40:05 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:55656:Jun 08 2015 09:40:05 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:55696:Jun 08 2015 09:40:09 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150609.gz:55722:Jun 08 2015 09:40:24 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150609.gz:70858:Jun 08 2015 13:12:12 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:70859:Jun 08 2015 13:12:12 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:75428:Jun 08 2015 13:45:23 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:75429:Jun 08 2015 13:45:23 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:199688:Jun 09 2015 02:54:31 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:199689:Jun 09 2015 02:54:31 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150609.gz:199770:Jun 09 2015 02:55:00 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199773:Jun 09 2015 02:55:00 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199776:Jun 09 2015 02:55:00 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199777:Jun 09 2015 02:55:01 GMT: WARNING (smd): (system_metadata.c::1914) Null response message passed in transaction complete!
aerospike.log-20150609.gz:199780:Jun 09 2015 02:55:02 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199783:Jun 09 2015 02:55:02 GMT: WARNING (paxos): (paxos.c::2972) unable to apply partition sync message state
aerospike.log-20150609.gz:199784:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199785:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199786:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199787:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199788:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199789:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199790:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150609.gz:199791:Jun 09 2015 02:55:02 GMT: WARNING (fabric): (fabric.c::2405) No fabric transmit structure in global hash for fabric transaction-id 10
aerospike.log-20150619.gz:113415:Jun 18 2015 13:52:13 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150619.gz:113430:Jun 18 2015 13:52:17 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150619.gz:113431:Jun 18 2015 13:52:17 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150619.gz:113453:Jun 18 2015 13:52:17 GMT: WARNING (paxos): (paxos.c::1591) No changes applied on paxos confirmation message, principal is bb93bcc0f0a0022. No sync messages will be sent
aerospike.log-20150622:228534:Jun 22 2015 09:35:47 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150622:228535:Jun 22 2015 09:35:47 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150622:228597:Jun 22 2015 09:36:09 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150623.gz:1380:Jun 22 2015 13:53:48 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
aerospike.log-20150623.gz:3117:Jun 22 2015 14:09:22 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150623.gz:3118:Jun 22 2015 14:09:22 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150623.gz:3158:Jun 22 2015 14:09:28 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150623.gz:3207:Jun 22 2015 14:09:55 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150623.gz:3280:Jun 22 2015 14:10:21 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.11.4.227:3002: timed out
aerospike.log-20150627.gz:131929:Jun 26 2015 16:02:15 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150627.gz:131930:Jun 26 2015 16:02:15 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...
aerospike.log-20150628.gz:86335:Jun 27 2015 14:36:11 GMT: WARNING (paxos): (paxos.c::2116) quorum visibility lost! Continuing anyway ...
aerospike.log-20150628.gz:86336:Jun 27 2015 14:36:11 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway ...

#2

Hi Revington,

Please check out https://www.aerospike.com/docs/deploy_guides/aws/recommendations/ , recommendations for running aerospike on AWS.

We have recommended not using EBS since we found a few issues with bcache implementation on AWS https://www.aerospike.com/docs/operations/plan/ssd/bcache/

-samir


#3

Hello Samir,

Thank you very much for your answer. As I said before we have a very low load. The bcache warning on the aerospike website says:

Due to a bug in the bcache kernel module, we have observed processes locking up when writing heavily to the bcache device. We are actively looking at solutions around this issue.

10 TPS can not be considered heavy writing.

Do you think this is related to the bcache bug?


#4

What are your heartbeat interval and timeout settings in your config? Setting up a cluster across regions is not a configuration we recommend nor do we actively test such a configuration. If you want to experiment with such a configuration, I would recommend increasing the heartbeat timeout value.

But I wouldn’t expect this to happen :confounded:.

How high is high cpu?

Do you track the number of open fds?

Does the number of open fds approach your ulimit -n?

Similarly, does the number of open fds approach Aerospike’s proto-fd-max? (there is a log line that prints periodically that prints fds - proto: (Number of opened connections between this node and clients, Number of connections ever opened between this node and clients, Number of connections ever closed between this node and clients). Should be able to grep for “fds - proto:”.


#5

Thank you kporter. Sorry if my comment was not clear: Servers are in the same region. They are not in a VPC. fds - proto are very low compared with the value of ulimit -n

Heartbeat was 150/10. I have changed that to 250/20 to see if that prevents the thing to happen. My main issue is that I have no clue of what is happening.


#6

Hi revington,

I’d recommend that let’s start by observing the behavior of the node during only the period when it was freshly restarted, to the time when it goes down:

What’s the timestamp when the node was freshly started? And when the node goes down, what was the last bit at the end of the log?

What is the cpu utilization pattern during this period of time?

What is the memory usage pattern during this period of time?

When running AMC, is there any abnormal pattern seen?

You said “too much open files”, how many files are being opened?


#7

HI, Aerospike got any fix for the timeout issue. today i saw same issue with r3.xlarge instance without VPC. according previous reply increase the timeout to 200/20 but not got any success.

Sep 14 2015 16:31:24 GMT: INFO (as): (as.c::452) service ready: soon there will be cake!
Sep 14 2015 16:31:24 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.145.10.97:3002: timed out
Sep 14 2015 16:31:24 GMT: WARNING (cf:socket): (socket.c::371) Error in delayed connect() to 10.145.10.97:3002: timed out

Any help much appreciated.


Error in delay connects after system ready message
#8

I opened all port for testing and seem working fine… previously opened all port mentioned on “http://www.aerospike.com/docs/deploy_guides/aws/install/” but not for for me.

TCP 22 SSH Port for accessing the instance.

TCP 3000-3004 Aerospike ports, for clients and other servers to communicate with this instance.

TCP 8081 HTTP Port for accessing AMC via a web browser.


#9

@laxman_Singh_Rathore,

In order to properly respond to your particular timeout issue, we have opened a new topic abour your issue here. Please continue the discussion in that new topic.