Degradation of performance as clients increased


#1

I’m currently evaluationg aerospike using ycsb based on the article at http://www.aerospike.com/blog/aerospike-doubles-in-memory-nosql-database-performance/

I’ve been evaluating multiple configurations, mostly with 2 replicas on 2 servers(2*6 core hyperthreading disabled, 96gb, ran ). We have up to 4 identical machines(with hyperthreading on) available as clients in the same rack, and are getting reasonably good performance(400k tps on workload B in pure memory which is 95/5 R/W). Initial configuration was done with afterburner.sh and trying to adjust settings manually has resulted in worse performance so I assume the settings are good.

I am currently concerned with the fact that as we increase clients performance is downgrading quite significantly. With a peak of roughly 150k tps on 50/50 R/W using 3 clients, we drop to 100k when using 4 clients. I have investigated the network as a possible cause and am finding no dropped packets. Aerospike logs are also not showing anything unusual. top, iotop, ifconfig do not seem to indicate cpu or network as a cause of the slowdown.

Is performance degradation to be expected as we surpass the maximum throughput for a configuration?

Also, any possible explanation as to why we are not seeing figures close to those in the aforementioned article?


#2

I suggest first simplifying the problem. Go to a single node with a replication factor of 1. This will remove inter-node communication as an issue. You should see somewhat better numbers than the 95%/5% case. Also, you should run the YCSB with a single client to make sure you understand the throughput limitation from a single one.

However, there are some other things I want to check as well. Please forgive me if these seem like stupid things to check, but we run into them often enough that I just want to be complete.

  • There are often values for retry counts like insertretrycount (you should check all of the ones that are *retrycount). These should be set to zero. One issue with the YCSB is that it handles retries. With Aerospike, the Aerospike smart client handles that. For some reason the retrycounts will often retry, even when it is successful. What this does is multiply out the number of attempts that the client will make and often winds up generating a lot of queries that only count as a single one from the YCSB. However, all these extra connections cause performance issues and my initial guess is that this could be causing your problem. This gets even worse as you add additional clients.
  • How many threads are you using on each client? I myself often use 64. You could use higher with more powerful machines.
  • I am assuming the servers are not VMs. Your numbers are already much higher than I would normally expect from a VM.
  • What are the sizes of the objects you are testing? Are you using the small ones listed in the article? Anything around 120 bytes should work pretty much the same way.
  • How many objects are you testing on? You should use something realistic, but we have found that you must have at least 1 M in the database or you will start to run into a hotkey issue. We have found people using a single record. The article specified 50 M.
  • How many network queues is on your network card? The afterburner attempts to balance these queues across cores. My guess is that with your numbers, you have at least 8. What you should check for is the load on each core and see what the “si” column looks like. If these are balanced across 8 cores (to match with the queues), you may be bottlenecked on that. This may not show up as a bandwidth limitation.

Please let us know how things go. We are more than happy to help tune your system.


#3

Hi bayoukingpin,

Can you please also look at the query raised by me at : Not able to get the required Throughput time ?


#4

I missed the reply here, but thank you for the response. We eventually identified the issue as one on the network, though I cannot unfortunately remember the full details at this time, but the issue has been resolved.


#5

We’re glad to hear this. Please let us know if you have any more questions.