We are experiencing regular
DeviceOverload errors on the aerospike client side of our application. We recently tried adding a 4th node to our previously 3-node cluster, and although the errors have decreased, they have not gone away.
After doing some research, I came across a thread that suggests the
DeviceOverload errors can be a symptom of write failures in the cluster. With that knowledge, I searched the log on each cluster to see how many of the “queue too deep” errors had shown up over about a 12 hour period:
grep -Rn 'too deep' /var/log/aerospike/aerospike.log | wc -l
Over that time period, the four clusters had the following number of “queue too deep” messages in their logs:
node 1: 263 node 2: 0 node 3: 9,860 node 4: 0
That seems to suggest that the vast majority of write failures are happening on just one node. What would cause this imbalance to occur? Could it just be a bad SSD drive on the one instance? Is there a good way for me to confirm this? I looked at
iostat on each of the nodes and the numbers looked pretty similar. Before I go through the trouble of replacing the node with a new EC2 instance, I’d like to confirm that it is indeed a disk issue and not some other problem (load balancing not working properly, etc.).
Just in case the information is helpful, our 4-node cluster consists of identical
r3.2xlarge instances with identical SSD instance stores that are 160 GB in size. We have LDT enabled on the cluster and are making use of it exclusively for the data we are storing. When I look at our AMC dashboard, everything looks fine and pretty similar across all four nodes. RAM usage is extremely low and disk usage (from a storage standpoint) is at about 10-15%.
I’d appreciate any help anyone can provide, even if it’s just to point me in the direction of some other things to investigate. If you have any questions or need more information, don’t hesitate to let me know and I’ll provide what I can. Thanks in advance.