Zig Zag transaction graph while perf testing Aerospike on ssd with 2 boxes

Saksham_Agrawal · January 1, 2021, 1:14pm

We are trying to benchmark aerospike and we are getting a zig-zag graph of TPS with benchmark tool Box: 20 cores, SSD. GRAPH:

Config:
service {
	user root
	group root
    pidfile /var/run/aerospike/asd.pid
}

logging {
        file /var/log/aerospike/aerospike.log {
		context any info
		context migrate debug
        }
}

network {
	service {
		address any
		port 3000
	}
	heartbeat {
		mode mesh
		port 3002
		mesh-seed-address-port 10.34.145.92	3002
		mesh-seed-address-port 10.32.78.156	 3002

		interval 150
		timeout 10
	}

	fabric {
		port 3001
	}

	info {
		port 3003
	}
}

namespace testDisk {
	replication-factor 2
	memory-size 40G
	storage-engine device {
		file /opt/aerospike/data/entities.dat
		filesize 200G
		max-write-cache 128M	
		write-block-size 512K
		scheduler-mode noop
	}
}

Tried with decreasing the block size to 128K then we get error of queue-too deep. So increased the max-write-cache size as a workaround.

Still we are not able to get stable a TPS.

benchMark Command: ./run_benmarks -h 10.34.145.92 -p 3000 -n testDisk -k 40000000 -S 1 -o S:4096 -w RU,0 -z 2 -async -asyncMaxCommands 50 -eventLoops 10 -latency ycsb

Please help us in narrowing down the issue.

meher · January 5, 2021, 1:59am

It looks like your node is not able to keep up… increasing max-write-cache would just ‘hide’ the errors but is an indication the storage subsystem is not keeping up. Try capping the throughput to get to a point where you can sustain the workload.

Reducing the write-block-size would end up reducing the post-write-queue… so you can ‘help’ by increasing the post-write-queue to help the storage sub-system and reducing the throughput.

Saksham_Agrawal · January 5, 2021, 9:38am

We are using 3.12 version and we did make post-write-queue as 2048(maximum), still it was not able to stabilise. Also we are using async client for benchmarking and in that case how can we reduce throughput ? [ AFAIK we can give concurrency parameters but throughput might be governed by the latency. ]

Also, we noticed a pattern that as soon as read disk ops come into picture that is when we start seeing drops and then that mountain pattern continues.

UPDATE: The issue was happening due to wrong configuration recipe, we changed the above configuration to device instead of file and this seems to work well.

meher · January 8, 2021, 4:36am

Thanks for the update. Here is the doc for the benchmark tool. Specifically the -g config for limiting throughput and -z for controlling the number of threads.

Check the read-page-cache parameter as well. The file config would have given you some of that and I am surprised it is making such a difference for you… maybe if you are a bit limited in RAM and the page cache churn is hurting at some point? In which case having device rather than file would give more consistent/predictable performance.

system · January 8, 2022, 4:36am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The TPS of Aerospike to insert sorted map	8	1314	June 6, 2017
Confusing benchmark results Aerospike Server Benchmarks	2	2788	June 22, 2015
Read/write performance spikes	1	3404	December 23, 2015
Not able to get the required Throughput time Tuning	8	6361	October 17, 2014
Not able to achieve 1Million TPS in Aerospike Benchmarks despite of capable hardware Aerospike Server Benchmarks	19	9400	March 29, 2017

Zig Zag transaction graph while perf testing Aerospike on ssd with 2 boxes

Related topics