Where is the bottleneck? Running Aerospike on NVMe

hedayati · April 11, 2018, 9:09pm

I am trying to run a local instance of aerospike server on a 2 x E5-2630 v3 (32 HT) machine. The attached nvme-of device (/dev/nvem0n1) is capable of over 2M IOPs for 1K IOs reads – verified by FIO (I am using B:100 in my benchmarks, only reads).

service {
paxos-single-replica-limit 1

work-directory run/work
pidfile run/asd.pid

service-threads 16
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 100000
}

namespace test {
	replication-factor 2
	memory-size 1G
	default-ttl 30d # 30 days, use 0 to never expire/evict.

	storage-engine device {
		# Use one or more lines like those below with actual device paths.
		device /dev/nvme0n1

		# The 2 lines below optimize for SSD.
		scheduler-mode noop
		write-block-size 4K
		data-in-memory false
	}
}

I tried running the benchmark both locally and on a different node. In both cases, I get less than 1M TPS (local: ~700K, remote: ~600K).

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6441 root      20   0 4285896 290072   6272 S  2051  0.4 413:54.88 asd                    
 6981 root      20   0 4853380   9612   4260 S  1088  0.0  15:29.61 benchmarks

And this is how perf top -p asdpid looks like.

   6.82%  [kernel]            [k] queued_spin_lock_slowpath
   1.60%  [kernel]            [k] _raw_spin_lock_irqsave
   1.49%  [kernel]            [k] tcp_ack
   1.47%  [kernel]            [k] __fget
   1.23%  libc-2.21.so        [.] __memcmp_sse4_1
   1.18%  [kernel]            [k] _raw_spin_lock
   1.17%  libpthread-2.21.so  [.] pthread_mutex_lock
   1.07%  [kernel]            [k] try_to_wake_up
   1.01%  asd                 [.] as_index_get_v

Edited for clarification: I am not running on a real SSD. The benchmark is running on a NVMe-OF device backed by DRAM on the target side (The bandwidth is limited by RDMA over IB link and the latencies are in O(10usec)).

Albot · April 11, 2018, 9:46pm

What do the histograms look like?

hedayati · April 12, 2018, 2:30pm

Apr 12 2018 14:11:08 GMT: INFO (info): (hist.c:139) histogram dump: {test}-read (36165690641 total) msec
Apr 12 2018 14:11:08 GMT: INFO (info): (hist.c:156)  (00: 22963278903) (01: 5763021768) (02: 3246872646) (03: 2691991850)
Apr 12 2018 14:11:08 GMT: INFO (info): (hist.c:165)  (04: 1500507183) (05: 0000018057) (06: 0000000234)

Is this what you are asking for?

pgupta · April 12, 2018, 3:07pm

32 Hyperthreads/1K objects - can you try service-threads =32, transaction-queues=32 and transaction-threads-per-queue=3?

hedayati · April 12, 2018, 3:31pm

Doesn’t seem to help (no significant change in TPS).

An additional data point that may help is that when I set data-in-memory true I can get 1.2M IOPs and I don’t see the queued_spin_lock_slowpath in perf top results. So whatever the bottleneck is, it has to be related to Flash (either some part in Aerospike on Linux storage stack).

I am using the https://github.com/aerospike/aerospike-server@8f0bb73.

pgupta · April 12, 2018, 3:47pm

Per this discussion thread, FAQ - What is the purpose of setting the disk scheduler? can you try scheduler-mode none?

hedayati · April 12, 2018, 4:57pm

Still no change after setting the scheduler to none.

hedayati · April 12, 2018, 7:40pm

After a little more profiling, it seems like the cf_queue_pop and cf_queue_push in ssd_fd_get() and ssd_fd_put() are not scalable enough to handle the load. Any thoughts how I may get around that? (Since I couldn’t find anyone else having similar issues, I tend to think there should be fix by changing the configurations, e.g., # of threads somewhere?)

rbotzer · April 12, 2018, 8:50pm

I’m going to question this assertion: " 2M IOPs for 1K IOs reads – verified by FIO". FIO is not able to generate the type of workloads that predict what the sustainable performance is for an SSD used by a database. How long did your FIO test run? SSDs have all sorts of caching tricks, from over-provisioning to others. A test of the SSD should run for enough hours at a sustained rate to truly understand what you can expect of it.

For this reason, Aerospike created the ACT tool, as a way to benchmark SSDs under sustained database workloads (in particular ones that simulate Aerospike). You haven’t mentioned which SSD you’re using, so it would be good for you to run ACT and confirm the drive capability before moving on to using asbenchmark (included in the latest tools package).

hedayati · April 12, 2018, 9:22pm

I am not running on a real SSD. The benchmark is running on a NVMe-OF device backed by DRAM on the target side (The bandwidth is limited by RDMA over IB link and the latencies are in O(10usec)).

Thanks for the ACT pointer. Will take a look and report back.

hedayati · April 13, 2018, 1:59pm

I created per-thread pool of open file descriptors for ssd_fd_get() and ssd_fd_put() and now the lock bottleneck is gone. With this change, I can get around 900 KIOPs before occupying all 32 hyperthreads.

Does these number make sense? How many IOPs would you expect to get with this number of HTs on a fast SSD?

Topic		Replies	Views
Not able to achieve 1Million TPS in Aerospike Benchmarks despite of capable hardware Aerospike Server Benchmarks	19	9375	March 29, 2017
Increased NVMe performance with BFQ scheduler enabled Aerospike Server Benchmarks	1	1906	January 6, 2019
Aerospike slow performance write/batch-read	3	3572	October 16, 2017
Read and Write Performance Issue with decent SSD Aerospike Server Benchmarks	0	2812	November 8, 2016
Unable to max out CPU in KVM environment Aerospike Server Benchmarks benchmark	1	2395	September 11, 2015

Where is the bottleneck? Running Aerospike on NVMe

Related topics