Zig Zag transaction graph while perf testing Aerospike on ssd with 2 boxes

We are trying to benchmark aerospike and we are getting a zig-zag graph of TPS with benchmark tool Box: 20 cores, SSD. GRAPH:

Config:
service {
	user root
	group root
    pidfile /var/run/aerospike/asd.pid
}

logging {
        file /var/log/aerospike/aerospike.log {
		context any info
		context migrate debug
        }
}

network {
	service {
		address any
		port 3000
	}
	heartbeat {
		mode mesh
		port 3002
		mesh-seed-address-port 10.34.145.92	3002
		mesh-seed-address-port 10.32.78.156	 3002

		interval 150
		timeout 10
	}

	fabric {
		port 3001
	}

	info {
		port 3003
	}
}

namespace testDisk {
	replication-factor 2
	memory-size 40G
	storage-engine device {
		file /opt/aerospike/data/entities.dat
		filesize 200G
		max-write-cache 128M	
		write-block-size 512K
		scheduler-mode noop
	}
}

Tried with decreasing the block size to 128K then we get error of queue-too deep. So increased the max-write-cache size as a workaround.

Still we are not able to get stable a TPS.

benchMark Command: ./run_benmarks -h 10.34.145.92 -p 3000 -n testDisk -k 40000000 -S 1 -o S:4096 -w RU,0 -z 2 -async -asyncMaxCommands 50 -eventLoops 10 -latency ycsb

Please help us in narrowing down the issue.

It looks like your node is not able to keep up… increasing max-write-cache would just ‘hide’ the errors but is an indication the storage subsystem is not keeping up. Try capping the throughput to get to a point where you can sustain the workload.

Reducing the write-block-size would end up reducing the post-write-queue… so you can ‘help’ by increasing the post-write-queue to help the storage sub-system and reducing the throughput.

We are using 3.12 version and we did make post-write-queue as 2048(maximum), still it was not able to stabilise. Also we are using async client for benchmarking and in that case how can we reduce throughput ? [ AFAIK we can give concurrency parameters but throughput might be governed by the latency. ]

Also, we noticed a pattern that as soon as read disk ops come into picture that is when we start seeing drops and then that mountain pattern continues.

UPDATE: The issue was happening due to wrong configuration recipe, we changed the above configuration to device instead of file and this seems to work well.

Thanks for the update. Here is the doc for the benchmark tool. Specifically the -g config for limiting throughput and -z for controlling the number of threads.

Check the read-page-cache parameter as well. The file config would have given you some of that and I am surprised it is making such a difference for you… maybe if you are a bit limited in RAM and the page cache churn is hurting at some point? In which case having device rather than file would give more consistent/predictable performance.

© 2021 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.