Hot write cause many slow writes in the whole cluster?

We are running the aerospike 3.14.1.9 with 4 nodes and we noticed each time there are hot keys (the server returned) it may cause many other keys also write slow (> 512ms).

Is it expected in aerospike? Any configure to reduce the impaction for hotkey, so we can have no slow write on normal keys?

I tried using a different namespace to isolate the possible hotkey, however it seems even in the different namespace some normal writes are also slow while hotkey happened in another namespace.

asloglatency -h write -N test -n 11 -f -04:18:00 -d 500 -n 11 -e 1

{test}-write
Sep 20 2018 07:01:46
               % > (ms)
slice-to (sec)      1      2      4      8     16     32     64    128    256    512   1024    ops/sec
-------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ----------
07:01:56    10   0.02   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      839.0
07:02:06    10   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      752.4
07:02:16    10   0.05   0.04   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      769.3
07:02:26    10   1.58   1.57   1.55   1.51   1.48   1.45   1.41   1.24   0.99   0.56   0.00     1610.9
07:02:36    10   1.86   1.82   1.78   1.78   1.77   1.73   1.61   1.41   1.06   0.25   0.00      851.7
07:02:46    10   0.05   0.02   0.01   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      841.2
07:02:56    10   0.01   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      845.0
07:03:06    10   0.03   0.03   0.01   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      757.5
07:03:17    11   0.01   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      674.1
07:03:27    10   0.01   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00     1630.4
07:03:37    10   0.01   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      882.7
07:03:47    10   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      842.0
07:03:57    10   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      820.8
07:04:07    10   0.01   0.01   0.01   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      724.0
07:04:17    10   3.85   3.78   3.75   3.75   3.71   3.59   3.35   2.97   2.26   1.17   0.03      725.1
07:04:27    10   0.46   0.44   0.42   0.40   0.37   0.35   0.27   0.18   0.01   0.00   0.00     1625.3

below is my config

service {
    user app
    group app
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    service-threads 12
    transaction-queues 8
    transaction-threads-per-queue 8
    proto-fd-max 15000
    proto-fd-idle-ms 60000
    work-directory /data/aerospike
    pidfile /data/aerospike/asd.pid
}
namespace test {
    replication-factor 3
    memory-size 60G
    default-ttl 0 # 30 days, use 0 to never expire/evict.
    stop-writes-pct 75
    #   storage-engine memory

    # To use file storage backing, comment out the line above and use the
    # following lines instead.
    storage-engine device {
        file /data/aerospike/test.dat
        filesize 500G
        data-in-memory false 
        write-block-size 1024K
    }
}

If you have enterprise support, the aerospike folks can give you some good advice… If not, then the first thing that I can say is that there is no great way to address a hotkey. Was it a read or a write hotkey? do you have the log entry? Other than changing the data model, you could benefit from using faster storage like an nvme drive or all in-memory (if you can)… There are some great resource and FAQ’s available on the website about hotkeys, have you looked at those yet? Search results for 'hotkey' - Aerospike Community Forum

Thanks very much for your advise.

We know sometimes there are hotkeys (mostly are write), and I am aware of the hotkeys can be slow. However, I just want to make sure it will not affect other normal keys. Hotkeys cannot be totally avoided because of the burst traffic in business. We already using the faster SSD, and from the result of asloglatency tool, the write TPS is not so high. Not sure there is a bottleneck on storage.

I did not found any useful resource in the search results so I give a try to make sure I am not missing some useful advices.

If aerospike itself cannot handle properly on write hotkeys on such low TPS with ssd, maybe we need to try some other methods. Maybe we need isolate the hotkeys from normal keys by ourselves.

I think I can try change the transaction-pending-limit=5 to make the hotkeys return more quickly to reduce impaction on server. And also increase the service-threads to CPU cores to allow server handle more requests also maybe help. Any advice on these?

1 Like

You speak as though your data model is impossible to improve upon, but I’d like to challenge that assumption and also get more info about your hardware. Lowering the transaction pending limit should definitely help limit the affects of hotkeys, but it’s really more of a bandaid. Can you give us more details on your data model, use case, and hardware specs? Why are you using a ‘file’ instead direct drive access? What is your workload (TPS, R/W/U ratio, object size)?

Because we are running nearly a hundred different business, it is hard to ask them to change their data model in a short time. So we want to make sure one hotkey in a business does not affect others.

For hardware, we are using SATA SSD, and we did a benchmark before with 4 nodes. The results showed it can achieve over 200K/s read and 80K/s update write.

For online usage, we are running about 30K/s read and 8K/s update+insert write, most object size is in 10~1000 bytes.

I am impressed with the aerospike’s great performance under most usage. It would be even better if it can handle the hotkeys to narrow down the impact on the whole cluster. In my case, we have a low write TPS but with 1 or 2 hotkeys, many normal keys latency increased to over 500ms. I think it can be better or am I missing some tune options?

Thank you.

   storage-engine device {
        file /data/aerospike/test.dat
        filesize 500G
        data-in-memory false 
        write-block-size 1024K
    }

Why not

storage-engine device {
        device /dev/mydevicepartition1
        device /dev/mydevicepartition2
        device /dev/mydevicepartition3
        data-in-memory false 
        write-block-size 1024K
    }

If you were using partitions the effect of the key should should be limited to that partition, I think, because each device gets it’s own write queue. There is an additional defrag thread and queue system for each device so there would be some more overhead, but it is tunable. The performance should be increased because you’d be bypassing the filesystem cache, and any hotkeys would only affect a particular partition instead of the entire namespace…

If you want to fix your hotkey issue, you have to look at either dealing with this from the writing application (like by buffering/batching/combining the incoming writes) or by changing the data model… Tweaking things might get you in a better place, or maybe even moving to in-memory, but you need to look at changing the application write behavior or the data model as the long term solution.

Thanks for your advice.

So you mean the write queue in aerospike is shared in whole namespace not in the partition? Maybe this can be optimized in future?

Another problem is that a different namespace also showed slow write while hotkeys not in this namespace. And we notice the output of iostat showed the IO is not busy at all. So I am not sure changing the storage-engine configure will help, but I will try it.

The write queue is per partition. If you have a write hotkey and only 1 partition in the namespace, then all writes will be affected for that namespace. If you split it up into partitions it should lessen the impact to that particular partition experiencing the hotkey, I think.

The write transaction only queues the write in memory, the write queue is processed asynchronously to the transaction. So these writes will not block on the write queue.

The hot writes may decrease the effectiveness of the post-write-queue. The post write queue contains the write buffers for recently written write blocks. Different versions of a record can appear in a write block multiple times - for certain hot key patterns, the hot keys could dominate the post write queue causing other keys to have to go to disk when they normally wouldn’t. This would cause increased disk load. Try increasing the post write queue size to see if this issue is reduced. Note, we do not recommend running in a mode that relies on the post-write-queue, you should benchmark your disks with ACT, which doesn’t simulate this cache, to ensure your disks are capable of handling such scenarios.

To explain how other namespaces are being impacted. All namespaces share the transaction queues and threads defined in the service section. If reading the disk is becoming a bottleneck, these threads may be blocked servicing these reads.

I believe by ‘partition’ he is referring to the 4096 ‘partitions’ that a namespace’s keys are distributed and not ‘disk partitions’. ‘Partition’ is a bit of a loaded term around here :smile:.

Nope. I meant the post write queue. Listen to @kporter not me.

Yeah, you got me. I mean 4096 partitions. :smile:

I did not see the high iowait, so maybe transaction queues and service-threads should be increased?

Are there any stats which I can know whether the post-write-queue is full?

We recently added this to our stats, unfortunately your version wouldn’t have it. However, there is a log line that displays the stat in the version you are running. Find the lines similar to the following:

{ns_name} device-usage: used-bytes 2054187648 avail-pct 92 cache-read-pct 12.35

The ‘cache-read-pct’ value is what you want.

These need a restart to change, we normally configure these based on core count. If necessary, I suspect either transaction-queues or transaction-threads-per-queue. I’d suggest trying to identify the bottleneck with microbenchmars first:

https://www.aerospike.com/docs/operations/monitor/latency/index.html

I saw the cache-read-pct 78.25 , however it was not changed so much during hot writes. Mostly above 70%.

1 Like

I noticed while slow write happened, many connection reset by server like below:

 parse result error: read tcp 10.255.202.27:16320->10.255.202.22:3000: read: connection reset by peer

In what case, the connection will be reset by server?

What’s the value of proto-fd-max?

The server will close sockets when this value is breached (which will log warnings) or when when a socket has been idle for over proto-fd-idle-ms.

proto-fd-max is 15000 and no warn log for close sockets. The number of server connections is below 6000.

I thought closing the idle connection should return EOF not reset.

Maybe the accept listen backlog is full while bursting requests?

Looks like we have the same issue. We are using aerospike Community version 4.5.1 We have:

  1. 4 nodes
  2. 2 namespaces: with data on disk and with data in memory
  3. each server has 4 ssd disks for data, used as device

from time to time (connected with increased load) we have the same symptoms:

  1. increasing latancy for write on both namespaces (histogram {ns}-write)
  2. increasing number of connections to servers
  3. no increasing load of disks. load of disks is ok. no impactions on histograms from enable-benchmarks-storage
  4. some hot key errors in metrics
  5. increasing in histogram {ns}-write-repl-write on both namespaces
  6. increasing in histogram {ns}-write-restart on both namespaces
  7. no impactions on histograms: /dev/sd*-write, {ns}-write-master, {ns}-write-response, {ns}-read

Did you find the reason of issue? What it can be?

There can be a wide range of causes for those symptoms… from a single node with subpar performance (whether hardware or software related – as busy neighbor or other in a public cloud for example) to something with the workload itself (hotkeys). For hotkeys, you can refer to the article on error code 14.

A single node slowing down in a cluster (again caused by a hotkey or else) will make all replica writes to that node slower, causing a higher number of connections (the clients would compensate the higher latency and try to preserve throughput by increasing the connection count), higher latency on the write-repl-write slice as well as write-restart (transactions having to be restarted because the previous transaction for the same primary key is still being processed).

A proper log analysis across all nodes would typically help get to the bottom of such issues.

We found the reason. It was network between Aerospike nodes. We had 4 nodes, and network to one of these nodes was not enough for us. As result write-repl-write metrica increased.

Solution was rebalancing network load. Thanks a lot for help!

1 Like