Where is the bottleneck? Running Aerospike on NVMe

I am trying to run a local instance of aerospike server on a 2 x E5-2630 v3 (32 HT) machine. The attached nvme-of device (/dev/nvem0n1) is capable of over 2M IOPs for 1K IOs reads – verified by FIO (I am using B:100 in my benchmarks, only reads).

service {
paxos-single-replica-limit 1

work-directory run/work
pidfile run/asd.pid

service-threads 16
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 100000

namespace test {
	replication-factor 2
	memory-size 1G
	default-ttl 30d # 30 days, use 0 to never expire/evict.

	storage-engine device {
		# Use one or more lines like those below with actual device paths.
		device /dev/nvme0n1

		# The 2 lines below optimize for SSD.
		scheduler-mode noop
		write-block-size 4K
		data-in-memory false

I tried running the benchmark both locally and on a different node. In both cases, I get less than 1M TPS (local: ~700K, remote: ~600K).

 6441 root      20   0 4285896 290072   6272 S  2051  0.4 413:54.88 asd                    
 6981 root      20   0 4853380   9612   4260 S  1088  0.0  15:29.61 benchmarks 

And this is how perf top -p asdpid looks like.

   6.82%  [kernel]            [k] queued_spin_lock_slowpath
   1.60%  [kernel]            [k] _raw_spin_lock_irqsave
   1.49%  [kernel]            [k] tcp_ack
   1.47%  [kernel]            [k] __fget
   1.23%  libc-2.21.so        [.] __memcmp_sse4_1
   1.18%  [kernel]            [k] _raw_spin_lock
   1.17%  libpthread-2.21.so  [.] pthread_mutex_lock
   1.07%  [kernel]            [k] try_to_wake_up
   1.01%  asd                 [.] as_index_get_v

Edited for clarification: I am not running on a real SSD. The benchmark is running on a NVMe-OF device backed by DRAM on the target side (The bandwidth is limited by RDMA over IB link and the latencies are in O(10usec)).

What do the histograms look like?

Apr 12 2018 14:11:08 GMT: INFO (info): (hist.c:139) histogram dump: {test}-read (36165690641 total) msec
Apr 12 2018 14:11:08 GMT: INFO (info): (hist.c:156)  (00: 22963278903) (01: 5763021768) (02: 3246872646) (03: 2691991850)
Apr 12 2018 14:11:08 GMT: INFO (info): (hist.c:165)  (04: 1500507183) (05: 0000018057) (06: 0000000234)

Is this what you are asking for?

32 Hyperthreads/1K objects - can you try service-threads =32, transaction-queues=32 and transaction-threads-per-queue=3?

Doesn’t seem to help (no significant change in TPS).

An additional data point that may help is that when I set data-in-memory true I can get 1.2M IOPs and I don’t see the queued_spin_lock_slowpath in perf top results. So whatever the bottleneck is, it has to be related to Flash (either some part in Aerospike on Linux storage stack).

I am using the https://github.com/aerospike/aerospike-server@8f0bb73.

Per this discussion thread, FAQ - What is the purpose of setting the disk scheduler? can you try scheduler-mode none?

Still no change after setting the scheduler to none.

After a little more profiling, it seems like the cf_queue_pop and cf_queue_push in ssd_fd_get() and ssd_fd_put() are not scalable enough to handle the load. Any thoughts how I may get around that? (Since I couldn’t find anyone else having similar issues, I tend to think there should be fix by changing the configurations, e.g., # of threads somewhere?)

I’m going to question this assertion: " 2M IOPs for 1K IOs reads – verified by FIO". FIO is not able to generate the type of workloads that predict what the sustainable performance is for an SSD used by a database. How long did your FIO test run? SSDs have all sorts of caching tricks, from over-provisioning to others. A test of the SSD should run for enough hours at a sustained rate to truly understand what you can expect of it.

For this reason, Aerospike created the ACT tool, as a way to benchmark SSDs under sustained database workloads (in particular ones that simulate Aerospike). You haven’t mentioned which SSD you’re using, so it would be good for you to run ACT and confirm the drive capability before moving on to using asbenchmark (included in the latest tools package).

I am not running on a real SSD. The benchmark is running on a NVMe-OF device backed by DRAM on the target side (The bandwidth is limited by RDMA over IB link and the latencies are in O(10usec)).

Thanks for the ACT pointer. Will take a look and report back.

I created per-thread pool of open file descriptors for ssd_fd_get() and ssd_fd_put() and now the lock bottleneck is gone. With this change, I can get around 900 KIOPs before occupying all 32 hyperthreads.

Does these number make sense? How many IOPs would you expect to get with this number of HTs on a fast SSD?