Very odd raw device behaviour


Hello all,

I have a cluster of 4 servers. One of the namespaces is raw device based. The devices reside on a SAS mechanical hard drive.

Now here is the weird part of the story. I am running one of the tests with small records (2x50 bytes = 100 bytes total). I get to write at between 150 - 200k OPS. Now when it comes to reading - the throughput drops to 4k OPS!!! Yes, I know - this is might weird, and I am totally confused.

The servers show very little load during the read. The iotop and nload show nothing I can identify as a problem.

Here is the device config:

namespace test-raw {
        replication-factor 4
        memory-size 16G
        default-ttl 7200
        max-ttl 2D
        high-water-disk-pct 80
        high-water-memory-pct 60
        stop-writes-pct 90
        partition-tree-locks 64
        partition-tree-sprigs 4096

        storage-engine device {
                device /dev/sdb1
                write-block-size 1M
                max-write-cache 8G
                data-in-memory false
                cold-start-empty true

Any insight would be much appreciated.




This is a hard disk problem. Just for reference:

And from the FAQ:

Can I store data on hard disk rather than SSD?

No. The Aerospike database is intended to be a high performance, low-latency database. Because of this, the physical limitations of rotational disks add an unacceptable amount of latency to the data.

Cross posted on serverfault.


Hmmm interesting … thinking aloud …since your records are only 100 bytes, you are probably using 256 bytes per record (with overhead & 128 byte boundary). If write-block-size, default is 1 MB, you are fitting about 4K records in 1 MB in RAM while writing, which is asynchronously flushed to disk as a 1 MB block. On read, you reading individual record from disk in 128 byte read chunks. If you are reading a recently updated record, you are probably getting it from post write queue in RAM otherwise you are accessing the disk. So your read delay is coming from slow performance of the disk for records that have to be fetched from the disk. If the write-block-size was 128K, then you would fit about 500 records per block. You can play with write-block-size on a test cluster and see if the performance tracks. Check write-q value in the /var/log/aerospike/aerospike.log to see if the disk is slow. If the disk is not the bottleneck, write-q will be zero under write throughput. You have a very large max-write-cache - 8G - (64M is default) which is also helping you with the writes. You can also test with reducing post-write-queue to a very small number and see if read throughput gets worse.