Why benchmark for multiple nodes in one cluster even slower than single node?

I have 3 VMs in the same network, when i do benchmark ( from c client SDK) for single node in cluster, the speed can get up to 2500t/s, but when i join a new node, benchmark get as slow as 1700t/s, then i join another node to the cluster, benchmark is about 1800t/s.

that’s strange, multiple node in cluster should be faster than single node just like couchbase, shouldn’t it?

here is my benchmark command: target/benchmarks -h 10.37.129.8 -p 3000 -n bar2 -k 1000000 -o B:1400 -w RU,50 -g 2500 -T 1000 -z 8

and here is results: 1 node: 2014-10-27 00:49:20 INFO write(tps=1303 timeouts=0 errors=0) read(tps=1205 timeouts=0 errors=0) total(tps=2508 timeouts=0 errors=0)

2 nodes: 2014-10-27 01:00:52 INFO write(tps=929 timeouts=0 errors=0) read(tps=934 timeouts=0 errors=0) total(tps=1863 timeouts=0 errors=0)

3 nodes: 2014-10-27 01:14:05 INFO write(tps=955 timeouts=0 errors=0) read(tps=939 timeouts=0 errors=0) total(tps=1894 timeouts=0 errors=0)

i can’t upload image, but the dashboard show the same result tps for benchmark. this confused me so much, can any senior tell me the reason?

There is actually a big difference for write performance when going from 1 node to more, when configured with replication factor 2 or more. With 1 node, you are forced down to replication factor 1. When adding a second node, the data from the first node will rebalance across both nodes (migrations) and all writes will now have to do twice as much work (write both master and replica - over network - before returning to the client).

There are a few extra factors to consider and make sure are not impacting performance when doing such experiment.

Make sure when adding a node that migrations have completed (if there was data on the first node), and make sure there is no bottleneck at the network level between both nodes or CPU is also not bottlenecking (may require some tuning).

thx for your reply, I noticed from the acm and sure migration is completed before benchmark, and as you predict, the reason cased multiple nodes slowing down is replication, after i set replica to 1, benchmark result in 2500t/s (2014-10-29 19:36:27 INFO write(tps=1194 timeouts=0 errors=0) read(tps=1312 timeouts=0 errors=0) total(tps=2506 timeouts=0 errors=0)) but i don’t think it is reasonable that replication slows down server tps speed, but unfortunately this is fact for the moment.

‘Reasonable’ here is a bit subjective I think. At least I don’t see how replication can be implemented without some performance cost. Would love it if it could, but it’s a no-free-lunch world and these guys seem to be striking a fair balance in delivering replication that is ‘reasonably’ performant.

In theory, if you have not bottlenecked on the server, you should be able to get comparable throughput (tps). Latency will for sure be impacted as each write will have to do twice as much and go over network to write replica copy before getting back to the client. So this extra latency will have some impact on throughput, everything else being equal.

So, you can try tuning further to achieve that tps (add more threads on the benchmark client, or more benchmark clients altogether) and check for bottlenecks on the server.