You may need to tune the benchmark threads to your client box. Also are you running the client from node dedicated to just this client?
If this is a test environment only, for convenience purposes, could you run the latest version? Or is there a particular reason you chose this version?
It isn’t apple to apples.
The dd command doesn’t read any data. The benchmark you are performing must read the data on disk when applying an update (in case you are updating a subset of the bins or adding new bins). There is a replace write-policy which will bypass this read. Looks like the java benchmark doesn’t have an option to enable this policy.
You are using 1MB blocks from dd and the server is using 128KB, try lowering the block size on the dd command to match the server.
I will do the steps you have the asked for and also update you about the same.
But I tried as below using the same bench marking java client:
I run both the client and db in the same machine(test environment). I will update aerospike to latest version. But I did the below test and thought of updating you so that you can put more lights into the same.
Case 1: I partitioned the SSD disk and mounted as a Linux file system (ext4):
The TPS I get for 5 threads as below:
./startAsBenchMark.sh -z 5 -n ssd_test -b 1 -o S:1000 -k 1000000 -readTimeout 1000 -writeTimeout 1000
In general, I wouldn’t recommend benchmarking with client running on the same host as the key value store and I would for sure not recommend benchmarking with write load on a single node cluster (except if the end goal is to run with replication factor 1).
Coming to your observation, this is caused by the cache a linux OS will leverage when having a file system. So it will give you increase performance for reads (depending on the data set size, how much of it fits in cache, and the access patterns). Having said that I would not recommend doing this for production workloads… When getting into situations where the OS has then to shuffle large amount of data from cache to disk to make space under high throughput situations, the CPU could suffer and this would impact latencies… For predictable and consistence performance, you should go in general with the raw device option.
I know this is an old thread, but I’d like to build on what others wrote.
The perception of low iops with large FIO bs values is common. Think about it – it naturally takes longer to write 1MB than it does to write 128KB, so you’re not going to be able to write that many blocks at once. Usually though the aggregate throughput over time is better with larger blocks.
We see this especially when people are testing network storage on VMs, because VMs often are iops and bandwidth throttled. It is common that someone will run a test with bs=1k and be disappointed at throughput, because the VM is throttling iops long before it throttles throughput.
Also, when testing against a filesystem vs a device, memory caching can perturb the results, depending on the options used.