Seeing errors during connect


#1

Hi,

We are using aerospike server of version 3.7.5.1 in production on AWS front-ended by AWS Lambda APIs in Python.

We use aerospike python client in AWS Lambda, to connect to the server. What we are seeing is that a decent majority of requests are having the errors of “ClientError: (-1L, ‘Failed to seed cluster’, ‘src/main/aerospike/as_cluster.c’, 417)”. The server is working fine, I see this errors after one second of trying to connect to the server. We receive a good number of requests and we try to connect to the server for every request.

Questions in this regard:

  1. The server health is fine. Any specific reasons for this error, and possible to avert the same ? Our timeout is 5 seconds, yet get this error just after a second. Is it that we are crossing a limit anywhere ?
  2. We are seeing good amount of CPU spikes in the server for a tps of 500. Is it because of constant connections creation for every API call.
  3. If the answer to 2) is yes, then can we create a connection upfront and use it for subsequent requests as well in AWS lambda ? Also what would be the lifetime a request can live if this the case?

Please let me know if you need anything else from our end.

Thanks, Karthik


#2

Does AQL work ok? How many hops between your client and your cluster?


#3

Yes, it is working fine. One hop only. Client directly connects to the cluster.


#4

You’re saying that you try to connect to the server for every request. Why are you doing that? Why not keep the connection open and continue to use the same client object? There is no need to reconnect for every request


#5

Any idea how long the client object would be available before a timeout or something ? Or is there any ? How do we handle situations like the server crashing, or changing it perhaps ? How do we invalidate those stale client objects, please understand that I am using AWS lambda. Any other way to refresh them ?


#6

Assuming you have a cluster (more than 1 node), the active cluster keeps track of which nodes are active and sends updates to the client. If for some reason a node goes down or comes online, this detail is sent to the client (in about 1 second or less, typically [configurable]). This way the client can keep up to date on what servers to call. The client object should say connected unless you close it. There is no timeout for idle activity. You can check in your code that the client object is still valid by checking the isconnected boolean. ex… if(client.isConnected()){ //we’re ok }

Not sure what you mean by “Any other way to refresh them?”. Like I said, you should be able to continue using the same client object :slight_smile:


#7

Yep, I have the isConnected check.

Also Albot, I see this error “AEROSPIKE_ERR_DEVICE_OVERLOAD” is my logs quite often. Specially while making doing a write. My writes are not much - 250/300 tps. Not sure what I could be missing in here.


#8

Are you connecting over and over from the same app (destroying and re-creating the client object)? thats what it sounded like


#9

I think the error you mentioned in your last post means your disk is too slow. What does iostat -x look like during your writes?


#10

Here is the output:

Yes, that is correct. I will try moving to creating a client upfront and using it in the application later-on for every read and write. Will get back if I still see the connect errors.

Here is the iostat output:

Linux 3.13.0-92-generic () 07/09/2017 x86_64 (2 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle 1.91 0.00 1.05 0.51 0.06 96.47

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.62 0.03 9.40 3.04 1137.43 241.70 0.05 5.12 2.90 5.13 1.25 1.18

The server is an m4.large instance with the root-device volume-type of “Provisioned IOPS SSD” of value 3000 IOPS.


#11

3000 IOPS does not mean you’ll get 3000 TPS. 1 write operation could be several IOPS. Maybe try running ACT against one of the instances and see how well it performs. Are you using “file” as your device storage? You should get better performance if you have a partition assigned instead of going through the kernel file system too.