Not able to achieve 1Million TPS in Aerospike Benchmarks despite of capable hardware


#1

Hi

I am trying out AeroSpike(AS from now on) for my project. I have read many blogs/articles of AS easily achieving 1M TPS. But I am not able to get those numbers through AS benchmark. I am not sure where is the bottleneck and what could I do to reach those figures. Please suggest. My current specs are listed below:

* AS server hardware: "Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz ", 40 Cores, 128 GB RAM.
* AS conf snapshot: 
service {
	paxos-single-replica-limit 1 
	service-threads 64
	transaction-queues 64
	transaction-threads-per-queue 64
	proto-fd-max 100000
}

namespace test {
	replication-factor 2
	memory-size 60G
	default-ttl 30d # 30 days, use 0 to never expire/evict.

	#storage-engine memory
	storage-engine device {
                device /dev/sss
                scheduler-mode noop
                write-block-size 128K
                data-in-memory true 
        }
}
  • I have around 7 client hosts with varying specs from where I am firing AS benchmark through a Java client. None of those run on same host on which AS server is running. One of those host has specs exactly same as AS server. Benchmark options I am providing is: " -n test -o B:100 -w RU,95 -z 60 --latency 7,1 "

As you can see, I am more read heavy.

Results I am getting is:

With single node, I am getting around 0.45 to 0.5 Million TPS for reads. 
WIth two nodes cluster, I get around 0.7 Million TPS for read.
Max. CPU utilization I see is 16% on AS server.

Please suggest, where is the bottleneck in achieving >=1M TPS. I suspect that it could be client which is not able to generate that many test cases or it could be network bottleneck. But how do I verify that ? Any suggestion will be really helpful.


Benchmarking Aerospike on my MacBook [Getting very poor results]
#2

For the procedures on how we achieved 1M TPS in EC2 see our AWS guide. In particluar, this step may be helpful.

Check that you are not being bound by network. The sar -n DEV command may be useful here.

For service-threads 64 we typically recommend setting this to be the number of cores, so in your case 40. For your configuration and for reads this will be the more important parameter.

You have configured 4096 transaction-threads :smiley: which seems like a lot. transaction-queues and transaction-threads-per-queue we typically recommend setting them both to 8, since you have data-in-memory true these threads will seldom be used.

Though not being measured here, your write performance may be improved by setting write-block-size to 1M.

Which Linux distribution are you using? We have seem better performance on distributions using later kernels as seen in the Amazon guide using Amazon Linux, which uses kernel version 3.14.


#3

Hi @kporter

Thanks for your replies; they are so much helpful :smile:

Your points on service/transaction threads were correct and modifying them has given me better results. But rather than modifying them directly, I used the afterburner script provided by AS and it tuned those configs itself. I hope that’s a correct way as well.

My new result is 1.4M TPS and this is really good. But I have hit another hurdle after continuing further with my experiments.

This time I have created a cluster of two nodes(earlier was just 1 node) and a replication factor of two. But rather than improving, my TPS actually falls to 0.9 M TPS. This is very strange as we of-course expected it to increase.

Another interesting point is that our AS server maxed out with CPU usage percentage of 70-75% and load average of 25-30. So, bottleneck was AS, not the network. I was also monitoring n/w usage through netload tool and it was not hitting max capacity. No, of clients I am using is 6(host)*120(threads).

I have seen this thread and it suggest migration and replication as possible problems. In my case, migration is not an issue as both my nodes start together on a clean slate. But replication is something which we want alongwith clustering in production and will like to work it as expected.

Kindly help on this as how we can achieve better TPS through clustering and replication. Thanks.


#4

If you’re using replication with factor 2 on 2 nodes, each write has to commit to both nodes before returning. That’s double the network traffic and longer transactions which will result in lower TPS.

There’s no way around that, you’re asking the system to do more work so it will perform slower. You can use the async client options or send a flag to not wait for a replica commit that should help with TPS if that’s acceptable in your application.

Two nodes = more availability and scalability as you add more clients.


#5

Thanks @manigandham for the reply.

I understand that replication can slow down stuff. But in my case, read:write ratio is 90:10. Can only that much write cause drop in TPS from 1.4M to 0.9M ? I can’t use async as I want to use aerospike as my primary data store as well and hence some guarantees.

Also, I wanted to check why the Java client doesn’t support reading from replicas(non primary nodes) while the C client does. Is there any tech reason not to do that ?


#6

What @manigandham says is spot on if you are measuring equal read and write throughput. Before you were measuring 95% reads, and I assume that is still the case?

With 95% reads I wouldn’t expect a performance drop. So we need to determine what is the bottleneck here.

Well, your reply just popped up and you have now changed it to 90% reads :).

Let’s try the following:

  1. First, let’s raise the number of fabric-workers from the default of 16 to 40 to match the number of cores. This parameter should be added to the service context where the you find the service-threads. This should help out the write replication.
    1. Repeat the benchmark. Provide output of the following:

sudo perf top —pid=pgrep asd ```

  1. Also benchmark with 100% reads. Again, provide output of the following:

sudo perf top —pid=pgrep asd ```

  1. If the previous doesn’t help or helps but hitting an unknown barrier we can then enable the servers microbenchmarks by running:

asinfo -v “set-config:context=service;microbenchmarks=true”


1. This will cause Aerospike to start dumping additional [performance histograms](http://www.aerospike.com/docs/operations/monitor/v3port/reading_raw_latency_histograms.html#interpreting-the-histogram-data) to the logs.
2. With this in place run the 90:10 benchmark again and, after the benchmark run is finished, provide the last 100 lines from the logs.

#7

It would seem that being able to spread reads across multiple replicas can help in the event of hot keys… But typically if an application has hot keys there will not be just a couple of them. And those keys will be optimistically distributed across the cluster.

On the other hand, in the absence of hot keys, it could be useful to read from both replicas simultaneously and ignore the slower request. This would reduce long tail latencies at the cost of addition network traffic and reduced peak throughput. We do not have this feature in our clients but at least one of our users have reported implementing it in-house with positive results for their use case. This should be less relevant now with our release of 3.5.2. In this release we have observed a significant reduction in long tail latencies due to the following fix:

KVS - AER-3250 - Improved latencies during nsup - use dual tree locks so reads and overwrites are not blocked while reducing a partition


#8

Thanks @kporter for the reply.

I will definitely provide the results for the commands you asked.

I just wanted to confirm when you said “We do not have this feature in our clients” were you talking about hedged/tied requests? Cause i am currently not talking about tail latencies.

Yes you are correct regarding the hot keys, but in our use case we expect huge spikes just on a couple of keys and have to solve/mitigate for it. And we are just keeping it simple and are highly optimistic Aerospike can do it all for us :smile:

I just want our reads to be distributed across replicas Cause in the C client the commit “Implement Per-Transaction Consistency Guarantees: deb5f44880b4f6d5e794b6e6917fab3d52e94b5b” gives us this facility, i think. The client justs asks us to set the read.replica=any for reading from the replicas. Is there a tech reason the java client does not have this feature or Am i missing something here?


#9

This feature has always been there, before it was the default and not it needs to be explicitly enabled.

It hasn’t been a priority to add to the Java client primarily due to the reasons I mentioned before. Apparently it isn’t a trivial feature to port and for most use cases doesn’t offer much benefit and requires the Java client to query additional data about the partition states for each node.

If lacking this feature is determine to be limiting your performance let us know by adding a feature request describing your use case.


#10

Thanks @kporter

In the java client as of now, if the primary node is not available, the data is read of the replica, right? This figuring out that primary is not available, send request to replica is done by the java smart client? Is this understanding correct?

Could you spare some time and elaborate on the problems you envision on supporting reading from replicas as well in the client? And will you be fine if i submitted a pull request?


#11

In the normal Aerospike configuration,

This translates to the cluster taking 1.5 seconds to realize that a node has left and to redistribute partitions.

The Java client by default queries the cluster every seconds for the partition map generation and if that generation has change it will then query for the new map.

So by default after a node has fails there is a 2 second window where reads and writes destined for that node will also fail.

This problem is shared in the C client but as you are correctly pointing out, since the C client can also query the cluster for the locations of replica copies it would likely retry on a node live replica during this window.

The java client will have to wait up to 2 seconds for an updated partition map.

I wouldn’t expect any problems, as I mentioned before, the feature currently hasn’t been a priority. Regarding the pull request, that would depend on our Java client comiters.


#12

I believe there’s something wrong with the Aerospike cluster.

I setup Aerospike server on 2 VPSes to form a cluster (Hardware: 8 x Intel Xeon E5-2630 v2 @ 2.60GHz + 16G RAM + SSD), it’s easy and looks good.

Then I installed PHP clients on 2 other VPSes (same hardware as the server’s), inserted 2 million rows of meaningful data, modified the client benchmark script and then run rw-concurrent.sh to test the cluster.

In the script rw-worker.php, I disabled all write operations, it’s now 100% reads.

I run the test with single server:

sh rw-concurrent.sh -c 2000 -n 50 run-rw-concurrent.log

This loads 2000 concurrent threads with 50 queries in each thread.

The result is:

Writes: 0 with 0 failed (0% success)
Reads: 100000 with 2 failed (99.998% success)
Max runtime: 1.052120923996s, average runtime: 0.050083642959595 total runtime: 100.16728591919s
102000 transactions executed at average 1018.2965332842tps per-process, combined to 96947.031157407tps
Actual runtime: 5.6628000736237s, actual tps: 18012.290505381, actual rps: 353.35875785555rps

So a single serve can provide 18ktps of 100% read capacity. I think I would expect 2x or near 2x of capacity if I start another server. Then I do the test on the cluster of 2 servers, with the same testing script:

sh rw-concurrent.sh -c 2000 -n 50 run-rw-concurrent.log

The result disappoints me:

Writes: 0 with 0 failed (0% success)
Reads: 100000 with 0 failed (100% success)
Max runtime: 1.4297289848328s, average runtime: 0.061245198488236 total runtime: 122.49039697647s
102000 transactions executed at average 832.71833970456tps per-process, combined to 71342.19217912tps
Actual runtime: 5.8575999736786s, actual tps: 17413.275139706, actual rps: 341.60748582894rps

It’s almost the same as one node, even worse.

I understand the cluster would slow down the write performance because of data replication. But why it also slows down the read performance? Is it the issue of Aerospike server, or my configuration?

Here’s the configuration of my server:

service {
        user root
        group root
        paxos-single-replica-limit 1
        pidfile /var/run/aerospike/asd.pid
        service-threads 8
        transaction-queues 8
        transaction-threads-per-queue 4
        proto-fd-max 25000
        fabric-workers 8
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode multicast
                address 239.254.169.78
                port 9918
    
                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

namespace test {
        replication-factor 2
        memory-size 4G
        default-ttl 0 # 30 days, use 0 to never expire/evict.

        storage-engine device {
                file /data/vdc/aerospike/data/test.dat # SSD mounted on /data/vdc
                filesize 16G
        }
}

I also enabled the servers microbenchmarks, and here’s the log on one server of the cluster:

Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads (243623 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000243158) (01: 0000000252) (02: 0000000124) (03: 0000000068)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (04: 0000000008) (05: 0000000006) (06: 0000000007)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (4978 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000004625) (01: 0000000197) (02: 0000000068) (03: 0000000012)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (04: 0000000024) (05: 0000000039) (06: 0000000006) (07: 0000000007)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: proxy (7 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000000003) (01: 0000000003) (02: 0000000001)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (4978 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000004625) (01: 0000000197) (02: 0000000068) (03: 0000000012)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (04: 0000000024) (05: 0000000039) (06: 0000000006) (07: 0000000007)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_cleanup (146245 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000146245)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_net (146245 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000146241) (01: 0000000003) (02: 0000000001)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_net (3051 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000003050) (01: 0000000001)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_storage_read (146245 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000146199) (01: 0000000020) (02: 0000000011) (03: 0000000009)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (04: 0000000005) (06: 0000000001)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_storage_open (146245 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000146245)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_tree (146245 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000146244) (03: 0000000001)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_internal (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_internal (3051 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000003049) (01: 0000000002)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_start (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_start (3051 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000003051)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_q_process (146245 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000146245)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_q_process (3051 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000003051)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: q_wait (149303 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000149201) (01: 0000000049) (02: 0000000030) (03: 0000000018)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (05: 0000000002) (06: 0000000003)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: demarshal_hist (176595 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000176300) (01: 0000000169) (02: 0000000083) (03: 0000000035)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (04: 0000000001) (05: 0000000004) (06: 0000000003)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: wt_master_wait_prole (3051 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000002807) (01: 0000000136) (02: 0000000046) (03: 0000000005)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (04: 0000000019) (05: 0000000025) (06: 0000000006) (07: 0000000007)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_prole (2931 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000002930) (03: 0000000001)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_resolve (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_resolve (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: reads_resolve_wait (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: writes_resolve_wait (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: error (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: batch_q_process (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: info_tr_q_process (27310 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000027309) (02: 0000000001)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: info_q_wait (27310 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000027047) (01: 0000000131) (02: 0000000108) (03: 0000000024)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: info_post_lock (27310 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::154)  (00: 0000021301) (01: 0000003286) (02: 0000002218) (03: 0000000444)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (04: 0000000052) (05: 0000000009)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: info_fulfill (27311 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000027302) (01: 0000000009)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: write_storage_close (3051 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000003051)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: write_sindex (0 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: defrag_storage_close (3084 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000003084)
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::137) histogram dump: prole_fabric_send (2931 total) msec
Apr 08 2015 12:20:14 GMT: INFO (info): (hist.c::163)  (00: 0000002930) (03: 0000000001)

I have no idea where the bottleneck is.


#13

Just as a general note, the script rw-concurrent.sh provided in the PHP package is designed to simulate a webserver load. It launches separate processes not threads, similar to say Apache in prefork mode. Each process has a PHP intepreter running a (single-threaded) program that makes n requests. Since these are separate processes assume that no more than the number of processes equal to the number of cores on your machine will be executing in parallel, at most.

There are more realistic ways to check for the performance if you are using the PHP client, mainly setting up a server with a simple PHP script, then using any of the load generating tools from ab to JMeter, etc.


#14

I believe the PHP clients are unable to push more load. If you are trying to find the peak TPS for the server in your environment, I would recommend using a client that can push more load, such as the Java client. If you are trying to see how your PHP application will scale then you may have already reached its peak load and need to add additional client machines.


#15

Also, please note that the 1M TPS has been observed only on bare metal and Amazon Cloud. Any other virtual private server provider may not be able to provide the required performance.


#16

@kporter - I have just been using afterburner.sh as described in your reference “this step” for the new 3.5.8 version.

Seemingly the helper_afterburner.sh script has been updated to handle more CPUs that older 3.x versions, but in order to handle 32 I still had to add a few lines manually. However with our 56cpu machines the input format leads to “Value too large for defined data type”. Seeing this issue I updated the script to set smp_affinity_list rather than smp_affinity. This worked all well up until CPU number 55, but when starting with “1” (like the original afterburner script) rather than “0” the last CPU(56) can’t be set, since obviously 56 can’t be the value for /proc/irq/nnn/smp_affinity_list in our case.

So my question is - Do you claim it’s correct to start the settings with “1” rather than “0” for the first CPU and if so, what to do with the last CPU? Should I just set that one to “0” instead?

Any clarifications would be great, and I suppose the afterburner.sh (& helper_) scripts may need an update?


#17

You should also note that in recent versions of the Linux kernel, irqbalance now seems to balance network interrupts across cores. There was a flaw in older editions where it did not do this. I am unsure when this happened, but we noticed this in the 3.10 Linux kernel and have anecdotal evidence that suggests it is also in recent 2.7 Linux kernels. What this means is that you no longer need to use the afterburner, simply make sure that irqbalance is on. I normally check the file /proc/interrupts to make sure that it is balancing interrupts across the cores. You also need to make sure that your NIC has as many queues as possible. NICs with only 1 queue can’t be balanced. Some will have 8. We have seen the newest NICs now will have as many queues as you have cores.

In addition you should make sure that the number of service-threads is set at the number of cores on the box (56 in your case).

During your tests, I recommend using a command like “watch -d -n 5 cat /proc/interrupts”. The -d will highlight values that have changed. You may need an extraordinarily wide terminal (and small font) to view this properly. If things are balanced, the interrupts on the network device should vary across different cores.


#18

However with our 56cpu machines the input format leads to “Value too large for defined data type”.


#19

The next release, 3.12.0, will address this issue.


#20

Aerospike 3.12 includes a new “auto-pin” feature which localizes threads to CPUs more optimally, and routes interrupts to cores more optimally, increasing performance. We see DRAM-based performance increases across the board, with the largest coming with batch read optimizations. Please consider re-doing any performance measurements with the 3.12 release.