Seeing errors during connect

kjayanth · July 4, 2017, 1:26pm

Hi,

We are using aerospike server of version 3.7.5.1 in production on AWS front-ended by AWS Lambda APIs in Python.

We use aerospike python client in AWS Lambda, to connect to the server. What we are seeing is that a decent majority of requests are having the errors of “ClientError: (-1L, ‘Failed to seed cluster’, ‘src/main/aerospike/as_cluster.c’, 417)”. The server is working fine, I see this errors after one second of trying to connect to the server. We receive a good number of requests and we try to connect to the server for every request.

Questions in this regard:

The server health is fine. Any specific reasons for this error, and possible to avert the same ? Our timeout is 5 seconds, yet get this error just after a second. Is it that we are crossing a limit anywhere ?
We are seeing good amount of CPU spikes in the server for a tps of 500. Is it because of constant connections creation for every API call.
If the answer to 2) is yes, then can we create a connection upfront and use it for subsequent requests as well in AWS lambda ? Also what would be the lifetime a request can live if this the case?

Please let me know if you need anything else from our end.

Thanks, Karthik

Albot · July 4, 2017, 7:33pm

Does AQL work ok? How many hops between your client and your cluster?

kjayanth · July 5, 2017, 2:47am

Yes, it is working fine. One hop only. Client directly connects to the cluster.

Albot · July 5, 2017, 7:23pm

You’re saying that you try to connect to the server for every request. Why are you doing that? Why not keep the connection open and continue to use the same client object? There is no need to reconnect for every request

kjayanth · July 6, 2017, 4:38am

Any idea how long the client object would be available before a timeout or something ? Or is there any ? How do we handle situations like the server crashing, or changing it perhaps ? How do we invalidate those stale client objects, please understand that I am using AWS lambda. Any other way to refresh them ?

Albot · July 8, 2017, 6:37pm

Assuming you have a cluster (more than 1 node), the active cluster keeps track of which nodes are active and sends updates to the client. If for some reason a node goes down or comes online, this detail is sent to the client (in about 1 second or less, typically [configurable]). This way the client can keep up to date on what servers to call. The client object should say connected unless you close it. There is no timeout for idle activity. You can check in your code that the client object is still valid by checking the isconnected boolean. ex… if(client.isConnected()){ //we’re ok }

Not sure what you mean by “Any other way to refresh them?”. Like I said, you should be able to continue using the same client object

kjayanth · July 9, 2017, 2:44am

Yep, I have the isConnected check.

Also Albot, I see this error “AEROSPIKE_ERR_DEVICE_OVERLOAD” is my logs quite often. Specially while making doing a write. My writes are not much - 250/300 tps. Not sure what I could be missing in here.

Albot · July 9, 2017, 3:37am

Are you connecting over and over from the same app (destroying and re-creating the client object)? thats what it sounded like

Albot · July 9, 2017, 4:52am

I think the error you mentioned in your last post means your disk is too slow. What does iostat -x look like during your writes?

kjayanth · July 9, 2017, 1:44pm

Here is the output:

Yes, that is correct. I will try moving to creating a client upfront and using it in the application later-on for every read and write. Will get back if I still see the connect errors.

Here is the iostat output:

Linux 3.13.0-92-generic () 07/09/2017 x86_64 (2 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle 1.91 0.00 1.05 0.51 0.06 96.47

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.62 0.03 9.40 3.04 1137.43 241.70 0.05 5.12 2.90 5.13 1.25 1.18

The server is an m4.large instance with the root-device volume-type of “Provisioned IOPS SSD” of value 3000 IOPS.

Albot · July 9, 2017, 10:17pm

3000 IOPS does not mean you’ll get 3000 TPS. 1 write operation could be several IOPS. Maybe try running ACT against one of the instances and see how well it performs. Are you using “file” as your device storage? You should get better performance if you have a partition assigned instead of going through the kernel file system too.

Topic		Replies	Views
AWS Lambda - Connection Limits AWS	3	1411	April 16, 2019
Aerospike Cluster Automatically Errors Node.js Client	3	3727	January 18, 2016
Connection errors from PHP client PHP Client Library	10	4234	September 4, 2015
Connecting from google cloud functions	4	1153	June 20, 2020
Client Connection taking time in cluster mode Configuration	7	2455	December 29, 2016

Seeing errors during connect

Related topics