Why is one node in our 4-node cluster having so many more 'queue too deep' issues?

We are experiencing regular DeviceOverload errors on the aerospike client side of our application. We recently tried adding a 4th node to our previously 3-node cluster, and although the errors have decreased, they have not gone away.

After doing some research, I came across a thread that suggests the DeviceOverload errors can be a symptom of write failures in the cluster. With that knowledge, I searched the log on each cluster to see how many of the “queue too deep” errors had shown up over about a 12 hour period:

grep -Rn 'too deep' /var/log/aerospike/aerospike.log | wc -l

Over that time period, the four clusters had the following number of “queue too deep” messages in their logs:

node 1: 263 node 2: 0 node 3: 9,860 node 4: 0

That seems to suggest that the vast majority of write failures are happening on just one node. What would cause this imbalance to occur? Could it just be a bad SSD drive on the one instance? Is there a good way for me to confirm this? I looked at iostat on each of the nodes and the numbers looked pretty similar. Before I go through the trouble of replacing the node with a new EC2 instance, I’d like to confirm that it is indeed a disk issue and not some other problem (load balancing not working properly, etc.).

Just in case the information is helpful, our 4-node cluster consists of identical r3.2xlarge instances with identical SSD instance stores that are 160 GB in size. We have LDT enabled on the cluster and are making use of it exclusively for the data we are storing. When I look at our AMC dashboard, everything looks fine and pretty similar across all four nodes. RAM usage is extremely low and disk usage (from a storage standpoint) is at about 10-15%.

I’d appreciate any help anyone can provide, even if it’s just to point me in the direction of some other things to investigate. If you have any questions or need more information, don’t hesitate to let me know and I’ll provide what I can. Thanks in advance.

LDTs aren’t really recommended. They’re tied to a parent record that owns the LDT information and lock it for every write to the LDT.

How is your data spread out? Are some LDT collections used more than others? If most of your writes are going to the same LDT collection, then you’re basically sending all those writes to the master node for that LDT parent record which could be the cause of your imbalance.

Thanks for replying. We went with LDT because we are storing fairly large data. I’m not sure how to look up LDT collections specifically, but from aql I can see that we are using one namespace and four sets. The most heavily used set has 461k objects in it, the next has 12k, and the other two are in the hundreds. If each set corresponds to an LDT collection, then yes, there is one LDT collection that is receiving a bulk of the writes. Even though there are 461k objects in that collection, do all writes to that collection only go to one node and not get distributed? Or shouldn’t at least the LDT parent records be distributed across the cluster? Is there documentation somewhere that goes into detail on the LDT collection/master node/parent record relationships? I’ve read the guide and the overview but didn’t see this level of detail. Thanks again for your help.

How many different LDT keys are you writing to? Each LDT requires a parent record that holds the collection info, this parent record’s key is what you target your operations with. Count how many unique keys there are to see how your writes are spread out.

The parent records are locked and updated every time you update the LDT so if you make a 1000 appends to an LDT list, the parent record has a 1000 operations on it, which can create hotspots and stall.

You might have to look at a different layout for your data using just top-level records and use a secondary index or different sets to group them together.

I’m not seeing an obvious way to query for the number of LDT keys (if there is one, please let me know), but I do know the number fluctuates dynamically based on our application. We estimate around 300k keys will be targeted every 30 minutes or so. Assuming there are roughly 300k unique keys, how does that show me how writes are spread out?

In order to fit the data we’re storing in the LDT, we break it into pieces, so the number of appends to the list is also dynamic depending on the size of the data being stored. It’s easy to see how that could create the hotspots/stalls you mentioned, although it still isn’t clear to me why it would be isolated to primarily one node.

Aside from moving to a completely different layout for our data, is there anything we can do with LDT to make use of better balancing across our cluster? If we did move to a different data layout, which of the solutions outlined here would you recommend?

Facing similar issue Have a 3 node cluster of c3.4x nodes with significant usage of mapoperations. Write Block size 1 MB Increased the max-write-cache from 64M to 256M since 1 day and still got few instances of such errors The Device overload errors are happening on 2 nodes and not even a single instance of that error on third node.

Also, it is observed that the issue not seeming to be related to load as there are instances when load is at peak and the issue didn’t occurred while when load is low and still the issue occured