Write speed is not upto the mark while using raw SSD device

Hi, I am trying to do a load testing while the storage name space is as below:

namespace ssd_test {
	replication-factor 1
	memory-size 16G
        
        storage-engine device {     
        device /dev/sdc                # raw device. Maximum size is 2 TiB
        write-block-size 128K       # adjust block size to make it efficient for SSDs.
        data-in-memory false
        max-write-cache 2048M
      }
}

The /dev/sdc is a SSD disk (of 600 GB) **cat /sys/block/sde/queue/rotational ** 0 I have followed up SSD Initialization doc before starting the test.

What I have observed is when I do Aerospike bench marking test the IOPS(1K) is as follows

sdc               0.00     0.00 **1277.50**    8.50  7531.75  4352.00    18.48     4.75    3.69    3.71    0.24   0.78  99.80-->512

sdc               0.00     0.00 1261.50    9.00  7252.50  4608.00    18.67     4.72    3.70    3.73    0.11   0.79  99.75-->512

sdc               0.00     0.00 1274.50    9.00  7527.25  4608.00    18.91     4.73    3.68    3.70    0.22   0.78  99.65-->512

sdc               0.00     0.00 1234.50    8.00  6446.25  4096.00    16.97     4.73    3.82    3.84    0.31   0.80  99.70-->512

sdc               0.00     0.00 1292.00    7.00  5766.75  3584.00    14.40     4.70    3.63    3.65    0.29   0.77  99.60-->512

sdc               0.00     0.00 1288.00    9.00  7547.50  4608.00    18.74     4.73    3.64    3.66    0.28   0.77  99.60-->512

sdc               0.00     0.00 **1215.00**   10.00  8203.75  5120.00    21.75     4.74    3.86    3.88    0.30   0.81  99.50-->512

But when I run dd command the IOPS is as below (50K):

with dd - dd if=/dev/zero of=/dev/sdc bs=1M status=progress :

sdc               0.00 50894.50    0.00  103.00     0.00 216010.00  4194.37   526.40 1563.26    0.00 1563.26   9.71 100.00-->2097.18

sdc               0.00 59668.00    0.00  103.00     0.00 215792.00  4190.14   526.54 3330.39    0.00 3330.39   9.71 100.00-->2095.07

sdc               0.00 51600.00    0.00   97.00     0.00 207002.00  4268.08   527.27 5010.64    0.00 5010.64  10.31 100.00-->2134.04

sdc               0.00 51244.00    0.00   99.50     0.00 214074.00  4302.99   526.76 5239.24    0.00 5239.24  10.05 100.00-->2151.5

sdc               0.00 51228.00    0.00  101.50     0.00 216506.00  4266.13   526.26 5261.77    0.00 5261.77   9.85 100.00-->2133.06

sdc               0.00 59513.00    0.00  100.00     0.00 213740.00  4274.80   526.73 5308.73    0.00 5308.73  10.00 100.00-->2137.4

sdc               0.00 51243.00    0.00   99.50     0.00 216114.00  4344.00   527.23 5293.43    0.00 5293.43  10.05 100.00-->2172

sdc               0.00 50980.00    0.00   99.50     0.00 212278.00  4266.89   527.20 5292.50    0.00 5292.50  10.05 100.00-->2133.45

sdc               0.00 50980.00    0.00   99.50     0.00 214010.00  4301.71   526.92 5312.93    0.00 5312.93  10.05 100.00-->2150.85

sdc               0.00 50980.00    0.00   98.00     0.00 208648.00  4258.12   526.75 5308.73    0.00 5308.73  10.20 100.00-->2129.06

sdc               0.00 50980.00    0.00   96.50     0.00 205390.00  4256.79   526.98 5343.41    0.00 5343.41  10.36 100.00-->2128.39

sdc               0.00 59652.00    0.00   99.50     0.00 212958.00  4280.56   527.05 5391.84    0.00 5391.84  10.05 100.00-->2140.28

sdc               0.00 50980.00    0.00  100.00     0.00 211940.00  4238.80   527.37 5354.20    0.00 5354.20  10.00 100.00-->2119.4

sdc               0.00 50980.00    0.00   99.00     0.00 211872.00  4280.24   527.11 5311.97    0.00 5311.97  10.10 100.00-->2140.12

sdc               0.00 50980.00    0.00  100.00     0.00 211940.00  4238.80   527.01 5290.19    0.00 5290.19  10.01 100.05-->2119.4

Is there anything specific to SSD configuration I am missing here?

Should I not compare dd operations with aerospike operations?

Please help me on this.

Regards Paresh

  1. Could you provide the rest of your config (specifically looking for the service context).
  2. How many clients are you running?
    1. Which benchmark client are you using?
    2. What parameters are you passing to the benchmark client?
    3. Does disk utilization increase if you add another client?

Hi Kevin,

Please find the details you had asked for.

  1. service context (default one, I have not changed anything):
service {
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	proto-fd-max 15000
}
  1. I am running AsBenchMark.jar(http://www.aerospike.com/docs/client/java/benchmarks.html) provided in the aerospike site. I started with 5 threads one client for testing The parameters are as follows:
./startAsBenchMark.sh -z 5 -n ssd_test -b 1 -o S:1000 -k 1000000 -readTimeout 1000 -writeTimeout 1000 -w RU,0

Yes disk utilization increases when I add one more client.

Please help me on this. Thank you very much! Regards Paresh

And what version of the server?

You are being limited by the client. Try adding more threads to the client:

./startAsBenchMark.sh -z 64 -n ssd_test -b 1 -o S:1000 -k 1000000 -readTimeout 1000 -writeTimeout 1000 -w RU,0

Hi Kevin,

Server version is : Aerospike Community Edition build 3.12.1.1

Regards Paresh

Hi Kavin,

I could not see any significance improvement in performance rather it ate of all my CPU . The iowait went up to 77%.

2017-10-05 15:57:00.914 write(tps=4067 timeouts=0 errors=0) read(tps=0 timeouts=0 errors=0) total(tps=4067 timeouts=0 errors=0)
2017-10-05 15:57:01.915 write(tps=4393 timeouts=0 errors=0) read(tps=0 timeouts=0 errors=0) total(tps=4393 timeouts=0 errors=0)
2017-10-05 15:57:02.915 write(tps=4685 timeouts=0 errors=0) read(tps=0 timeouts=0 errors=0) total(tps=4685 timeouts=0 errors=0)
2017-10-05 15:57:03.916 write(tps=4458 timeouts=0 errors=0) read(tps=0 timeouts=0 errors=0) total(tps=4458 timeouts=0 errors=0)
2017-10-05 15:57:04.916 write(tps=4465 timeouts=0 errors=0) read(tps=0 timeouts=0 errors=0) total(tps=4465 timeouts=0 errors=0)
2017-10-05 15:57:05.916 write(tps=4611 timeouts=0 errors=0) read(tps=0 timeouts=0 errors=0) total(tps=4611 timeouts=0 errors=0)
Linux 3.10.0-327.el7.x86_64 (basrv-pfdb1) 	10/05/2017 	_x86_64_	(24 CPU)
03:57:48 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
03:57:50 PM     all      3.79      0.00      1.84     77.61      0.00     16.75
03:57:52 PM     all      4.37      0.00      1.99     77.59      0.00     16.06
03:57:54 PM     all      3.91      0.00      2.03     78.31      0.00     15.75
  1. You may need to tune the benchmark threads to your client box. Also are you running the client from node dedicated to just this client?
  1. If this is a test environment only, for convenience purposes, could you run the latest version? Or is there a particular reason you chose this version?
  1. It isn’t apple to apples.

    1. The dd command doesn’t read any data. The benchmark you are performing must read the data on disk when applying an update (in case you are updating a subset of the bins or adding new bins). There is a replace write-policy which will bypass this read. Looks like the java benchmark doesn’t have an option to enable this policy.
    2. You are using 1MB blocks from dd and the server is using 128KB, try lowering the block size on the dd command to match the server.
    3. Could you also compare with ACT?
  2. Could you query the server’s active configuration?

asadm -e "show config service"

Hi Kevin,

Thank you for the guidance.

I will do the steps you have the asked for and also update you about the same.

But I tried as below using the same bench marking java client:

I run both the client and db in the same machine(test environment). I will update aerospike to latest version. But I did the below test and thought of updating you so that you can put more lights into the same.

Case 1: I partitioned the SSD disk and mounted as a Linux file system (ext4): The TPS I get for 5 threads as below: ./startAsBenchMark.sh -z 5 -n ssd_test -b 1 -o S:1000 -k 1000000 -readTimeout 1000 -writeTimeout 1000

         2017-10-06 10:00:26.003 write(tps=37595 timeouts=0 errors=0) read(tps=37227 timeouts=0 errors=0) total(tps=74822 timeouts=0 errors=0)
         2017-10-06 10:00:27.003 write(tps=36130 timeouts=0 errors=0) read(tps=36126 timeouts=0 errors=0) total(tps=72256 timeouts=0 errors=0)
         2017-10-06 10:00:28.003 write(tps=36985 timeouts=0 errors=0) read(tps=37423 timeouts=0 errors=0) total(tps=74408 timeouts=0 errors=0)
         2017-10-06 10:00:29.004 write(tps=38310 timeouts=0 errors=0) read(tps=38137 timeouts=0 errors=0) total(tps=76447 timeouts=0 errors=0)

Case 2: When I choose the raw device as storage option I could not get above speed rather very less speed.

Regards

Paresh

In general, I wouldn’t recommend benchmarking with client running on the same host as the key value store and I would for sure not recommend benchmarking with write load on a single node cluster (except if the end goal is to run with replication factor 1).

Coming to your observation, this is caused by the cache a linux OS will leverage when having a file system. So it will give you increase performance for reads (depending on the data set size, how much of it fits in cache, and the access patterns). Having said that I would not recommend doing this for production workloads… When getting into situations where the OS has then to shuffle large amount of data from cache to disk to make space under high throughput situations, the CPU could suffer and this would impact latencies… For predictable and consistence performance, you should go in general with the raw device option.

Here is another topic with some details about this: Using filesystem to have better caching via VFS - #7 by sunil

Thank you Meher!

One of my goal is to write to a single node cluster and I am trying to test how much max I can get when storage option is as below:

data in memory : false (Only Index will be stored in DRAM) and storage option is either SSD or usual rotational DIsk .

I will re-check raw device option as suggested by you and also I will consider Kavins comments.

Regards Paresh

I know this is an old thread, but I’d like to build on what others wrote.

The perception of low iops with large FIO bs values is common. Think about it – it naturally takes longer to write 1MB than it does to write 128KB, so you’re not going to be able to write that many blocks at once. Usually though the aggregate throughput over time is better with larger blocks.

We see this especially when people are testing network storage on VMs, because VMs often are iops and bandwidth throttled. It is common that someone will run a test with bs=1k and be disappointed at throughput, because the VM is throttling iops long before it throttles throughput.

Also, when testing against a filesystem vs a device, memory caching can perturb the results, depending on the options used.