Network throughput issues with asd running

Hi,

We’re running 3.5.15 on EC2 on i2.2xlarge. Recently we noticed that on some instances when the aerospike daemon is running, under light load, we start getting a lot of client timeouts and the cluster throughput decreases a lot. The ping time to the machine in question would increase substantially (from sub 1ms to 30-40ms). As soon as the asd service is brought down, the ping times go back to normal. The number of open sockets on the box is also pretty high while the service is running. Have you seen this sort of issue before? Any suggestions on further investigation?

Thanks!

Btw, this happened again today on a different node in the cluster.

UPDATE: 10/08 - it just happened on yet another instance. When this happens a very large number of the requests end up timing out so we end up removing that node from the cluster. Bringing it back up after a few hours or a day has the same behavior.

This happened again today on a different ec2 instance.

@kporter, have you seen anything like this before? Any thoughts on what might help investigate this?

Couple of questions comes to mind. Are you using EBS for your storage? What’s the average object size?

It would be interesting to enable micro-benchmark and storage-benchmark during these events.

Please see:

and

These should help you track if certain latency spike are occuring.

Also you may want to check on open sockets by running:

sudo lsof|grep `sudo ps aux|grep -v grep|grep -E 'asd|cld'|awk '{print $2}'` 2>/dev/null

Recommend also reading:

@lucien,

This happened again, this time on a new node as soon as I added it to the cluster.

We’re using SSD only, no EBS. Average object size should be pretty small - probably a few kb (our block size is 512k) - is there a stat for it somewhere? I saved the lsof output and it’s several MB so I’ve put it up here – http://eyeview-tmp.s3.amazonaws.com/lsof.out.gz

I also ran ss -s if that helps:

Total: 4422 (kernel 0)
TCP:   71049 (estab 1469, closed 63166, orphaned 4358, synrecv 0, timewait 62419/0), ports 0

Transport Total     IP        IPv6
*         0         -         -        
RAW       0         0         0        
UDP       9         5         4        
TCP       7883      7880      3        
INET      7892      7885      7        
FRAG      0         0         0        

As far as the micro benchmarks - once I enable it what data would you like me to provide here?

Thanks!

Looks like you are holding on to half open sockets.

asd       28322             root 2027u     sock                0,7       0t0    3479317 can't identify protocol
asd       28322             root 2028u     sock                0,7       0t0    3687513 can't identify protocol
asd       28322             root 2029u     sock                0,7       0t0    3687514 can't identify protocol
asd       28322             root 2030u     sock                0,7       0t0    3687515 can't identify protocol
asd       28322             root 2031u     sock                0,7       0t0    3687516 can't identify protocol
asd       28322             root 2032u     sock                0,7       0t0    3687517 can't identify protocol
asd       28322             root 2033u     sock                0,7       0t0    3264426 can't identify protocol
asd       28322             root 2034u     sock                0,7       0t0    3687518 can't identify protocol

Are the record set being closed properly from the client (rs.close)

Also are you doing any queries on non-existing set? (There was an issue fix in 3.6.0)

Please see release notes for patches since 3.5.15

http://www.aerospike.com/download/server/notes.html#3.6.3

@lucien,

The only place I see RecordSet used in the API is around query and we don’t use that - we only do (single and batch) put, get and execute. We only have 3 sets so we shouldn’t be querying any non-existing sets. Also, when I swap a node that starts having this problem with a new ec2 node the problem goes away.

Any other thoughts on further investigation? I didn’t see anything particular in the 3.6 release notes that sounds like this issue but we’ll upgrade, it’s just a matter of finding the time.

@lucien,

One thing I was thinking about is that we’re not currently using VPC with Placement Group on ec2 as recommended by aerospike. Do you think this could be related to EC2 Classic network performance? We did have an AWS engineer look into one of the instances that showed this issue and they didn’t find anything but if you think this could be related please let me know. Unfortunately moving to VPC with Placement Groups will require a full cluster migration to a whole new set of instance which will take quite a while so it doesn’t make sense to start unless we think it will fix this issue.

Thanks!

@naoum,

Are all the nodes on the same availability zone? At this point I can’t say for sure its a VPC issue, but being in the same availability zone would help stabilize the cluster.

@lucien,

We are currently running aerospike on 2 AZ-s and we want to enable rack awareness but have not yet done that (we want to stabilize things first). The clients are on 4 AZ-s. Being on only 1 AZ does not sound like a great high availability strategy for when an AZ goes down.

There’s an interesting article on this topic of availability zone. Please see:

@naoum,

Linking to a related question you just asked about deploying on multiple AZs, in which we recommended the correct way of using Aerospike on EC2: building a cluster in a single AZ - better yet a single placement group within it - and using either an application-level synchronization with something like Kafka, or using our Enterprise Edition feature, XDR (cross-datacenter replication) to another AZ or region.

Understood - we based this setup off of Rack awareness on ec2 but seems like we’re not as lucky as the client mentioned there. We’ll switch to a single AZ and see how it affects stability.

Thanks for your help!

@lucien,

I was just trying this today on a smaller in-memory only cluster today. I just set up asd 3.6.4 on a 2 node (4 cores each) cluster, in a placement group, in VPC, single AZ. The cluster I am migrating from is a 4 node (2 cores each) in 2 AZs. As soon as I switched the traffic over the ping times went crazy, in the 50ms range. I saw some warnings about running out of fd-s so i increased the number but that didn’t change anything. I also tried to bounce the nodes, one by one, and then together, but at that point they couldn’t form a stable cluster anymore. I ended up switching back to the old cluster. CPU usage was light (30%), load average wasn’t high either - around 2, network i/o was average.

Sounds like there’s something else at play, not just vpc/networking. Are there any special/recommended kernel settings? Any max number of connection settings that might affect us other than proto-fd-max (I just increased it to 50k)? We’re running on Ubuntu 14.04 with 3.13.0-36-generic

Thanks!

P.S. I installed a new cluster with the same hardware setup as before (4 2-core machines on 2 AZs), but all inside VPC and now I am back to normal. Having this work on more boxes is making me think that we’re hitting some sort of a per-box limit.

You mentioned the ping time going crazy as soon as you sent traffic. You went from a cluster of 4 nodes to a cluster of 2 nodes. Its possible that a network bandwidth is being reached on these aws instances when more connections are hitting two nodes instead of 4 nodes. Also could be a bandwidth issue in the aws shared link between clients and servers.

You may be able to use iperf tool to test bandwidth between a client and nodes on the new cluster. Please see:

On the server you may also be able to run

sar -n DEV

to see network utilization.

on aerospike side the number of open file descriptors can be set through proto-fd-max. Was there any warning in aerospike logs for connections limits being reached?

Did you go back to using 2 AZs or are the 4 new nodes on the same AZ?

@lucien,

The bandwidth requirement of this cluster seems pretty small - each node is currently using under 40 mbps Rx and under 30 mbps Tx so on the 2 node cluster it was probably double that. I actually did measure the bandwidth with dd+nc between random client and a server and it was pretty decent (I think I got something like 750 mbps one way). There’s no way to check that between each server/client (I have about 150 clients) but if it was just crappy aws networking wouldn’t you be seeing this for all of your customers on aws? Do you have any observation at what bandwidth things start getting bad on aws? As for the number of connections - is there a number after which it becomes unstable? Currently each of these clients have about 1200 established connections which doesn’t seem too high. We actually decreased the client connection pool size used by the java client some time ago to decrease the number of connections to each server in case that was causing any issues.

I can look into enabling sar - I currently use glances to gather network i/o stats. After I decreased the number of nodes there was indeed a warning about the fd being above the limit which is why i bumped it to 50k after which i didnt see that error again but the ping times didn’t go away.

I did go back to the set up across 2 AZs to try to limit the number of differences with the original cluster I was migrating from, except now in VPC with two placement groups.

Not understanding what’s causing this issue is making me very nervous as I am afraid my prod cluster can go into this state again at any time and there’s nothing I can do to prevent or mitigate that when it happens. If you have any other thoughts or information that you think I can collect next time this happens do let me know - I really appreciate your help!

Thanks!

For the Nth time, you cannot set up a cluster with nodes across AZs and not expect it to get all kinds of weird behaviors. Aerospike is a distributed database that is intended to have its nodes physically near each other, with predictable low latency to their network. If you go against that advice for Amazon EC2, you can and will get what you’re seeing.

@rbotzer,

Just to be clear - this time I got this issue WHEN I moved the recommended cluster setup to one AZ with VPC and placement group (and fewer nodes than originally). After that I switched back to the old 2-AZ setup to recover from the issue…

Also, when this issue happens i get high ping times to the affected nodes from within the same AZ as well, not just across AZs.

Thanks for the additional info. So 150 clients x 1200 established connections around 180000 connections. With two servers you may need to bump up proto-fd-max to 91000

I’m also wondering if your config file is tuned properly as far as service-threads and transaction-queues

For an i2.2xlarge with 8 vcpu both service-threads and transaction-queues should be set to 8 (number of cores) in the case of i2.2xlarge

service-threads controls the number of threads receiving client requests on the network interface.

Please see:

http://www.aerospike.com/docs/reference/configuration/#service-threads

http://www.aerospike.com/docs/reference/configuration/#transaction-queues

Just to update the thread - we switched to single AZ and placement group but continued to see this issue. The current thinking is that this was due to a kernel bug so we’re updating the kernel OS version. @lucien FYI