We are running the aerospike 3.14.1.9 with 4 nodes and we noticed each time there are hot keys (the server returned) it may cause many other keys also write slow (> 512ms).
Is it expected in aerospike? Any configure to reduce the impaction for hotkey, so we can have no slow write on normal keys?
I tried using a different namespace to isolate the possible hotkey, however it seems even in the different namespace some normal writes are also slow while hotkey happened in another namespace.
service {
user app
group app
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
service-threads 12
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
proto-fd-idle-ms 60000
work-directory /data/aerospike
pidfile /data/aerospike/asd.pid
}
namespace test {
replication-factor 3
memory-size 60G
default-ttl 0 # 30 days, use 0 to never expire/evict.
stop-writes-pct 75
# storage-engine memory
# To use file storage backing, comment out the line above and use the
# following lines instead.
storage-engine device {
file /data/aerospike/test.dat
filesize 500G
data-in-memory false
write-block-size 1024K
}
}
If you have enterprise support, the aerospike folks can give you some good advice… If not, then the first thing that I can say is that there is no great way to address a hotkey. Was it a read or a write hotkey? do you have the log entry? Other than changing the data model, you could benefit from using faster storage like an nvme drive or all in-memory (if you can)… There are some great resource and FAQ’s available on the website about hotkeys, have you looked at those yet? Search results for 'hotkey' - Aerospike Community Forum
We know sometimes there are hotkeys (mostly are write), and I am aware of the hotkeys can be slow. However, I just want to make sure it will not affect other normal keys. Hotkeys cannot be totally avoided because of the burst traffic in business. We already using the faster SSD, and from the result of asloglatency tool, the write TPS is not so high. Not sure there is a bottleneck on storage.
I did not found any useful resource in the search results so I give a try to make sure I am not missing some useful advices.
If aerospike itself cannot handle properly on write hotkeys on such low TPS with ssd, maybe we need to try some other methods. Maybe we need isolate the hotkeys from normal keys by ourselves.
I think I can try change the transaction-pending-limit=5 to make the hotkeys return more quickly to reduce impaction on server. And also increase the service-threads to CPU cores
to allow server handle more requests also maybe help. Any advice on these?
You speak as though your data model is impossible to improve upon, but I’d like to challenge that assumption and also get more info about your hardware. Lowering the transaction pending limit should definitely help limit the affects of hotkeys, but it’s really more of a bandaid.
Can you give us more details on your data model, use case, and hardware specs? Why are you using a ‘file’ instead direct drive access? What is your workload (TPS, R/W/U ratio, object size)?
Because we are running nearly a hundred different business, it is hard to ask them to change their data model in a short time. So we want to make sure one hotkey in a business does not affect others.
For hardware, we are using SATA SSD, and we did a benchmark before with 4 nodes. The results showed it can achieve over 200K/s read and 80K/s update write.
For online usage, we are running about 30K/s read and 8K/s update+insert write, most object size is in 10~1000 bytes.
I am impressed with the aerospike’s great performance under most usage. It would be even better if it can handle the hotkeys to narrow down the impact on the whole cluster. In my case, we have a low write TPS but with 1 or 2 hotkeys, many normal keys latency increased to over 500ms. I think it can be better or am I missing some tune options?
If you were using partitions the effect of the key should should be limited to that partition, I think, because each device gets it’s own write queue. There is an additional defrag thread and queue system for each device so there would be some more overhead, but it is tunable. The performance should be increased because you’d be bypassing the filesystem cache, and any hotkeys would only affect a particular partition instead of the entire namespace…
If you want to fix your hotkey issue, you have to look at either dealing with this from the writing application (like by buffering/batching/combining the incoming writes) or by changing the data model… Tweaking things might get you in a better place, or maybe even moving to in-memory, but you need to look at changing the application write behavior or the data model as the long term solution.
So you mean the write queue in aerospike is shared in whole namespace not in the partition? Maybe this can be optimized in future?
Another problem is that a different namespace also showed slow write while hotkeys not in this namespace. And we notice the output of iostat showed the IO is not busy at all. So I am not sure changing the storage-engine configure will help, but I will try it.
The write queue is per partition. If you have a write hotkey and only 1 partition in the namespace, then all writes will be affected for that namespace. If you split it up into partitions it should lessen the impact to that particular partition experiencing the hotkey, I think.
The write transaction only queues the write in memory, the write queue is processed asynchronously to the transaction. So these writes will not block on the write queue.
The hot writes may decrease the effectiveness of the post-write-queue. The post write queue contains the write buffers for recently written write blocks. Different versions of a record can appear in a write block multiple times - for certain hot key patterns, the hot keys could dominate the post write queue causing other keys to have to go to disk when they normally wouldn’t. This would cause increased disk load. Try increasing the post write queue size to see if this issue is reduced. Note, we do not recommend running in a mode that relies on the post-write-queue, you should benchmark your disks with ACT, which doesn’t simulate this cache, to ensure your disks are capable of handling such scenarios.
To explain how other namespaces are being impacted. All namespaces share the transaction queues and threads defined in the service section. If reading the disk is becoming a bottleneck, these threads may be blocked servicing these reads.
I believe by ‘partition’ he is referring to the 4096 ‘partitions’ that a namespace’s keys are distributed and not ‘disk partitions’. ‘Partition’ is a bit of a loaded term around here .
We recently added this to our stats, unfortunately your version wouldn’t have it. However, there is a log line that displays the stat in the version you are running. Find the lines similar to the following:
These need a restart to change, we normally configure these based on core count. If necessary, I suspect either transaction-queues or transaction-threads-per-queue. I’d suggest trying to identify the bottleneck with microbenchmars first:
There can be a wide range of causes for those symptoms… from a single node with subpar performance (whether hardware or software related – as busy neighbor or other in a public cloud for example) to something with the workload itself (hotkeys). For hotkeys, you can refer to the article on error code 14.
A single node slowing down in a cluster (again caused by a hotkey or else) will make all replica writes to that node slower, causing a higher number of connections (the clients would compensate the higher latency and try to preserve throughput by increasing the connection count), higher latency on the write-repl-write slice as well as write-restart (transactions having to be restarted because the previous transaction for the same primary key is still being processed).
A proper log analysis across all nodes would typically help get to the bottom of such issues.
We found the reason. It was network between Aerospike nodes. We had 4 nodes, and network to one of these nodes was not enough for us. As result write-repl-write metrica increased.
Solution was rebalancing network load.
Thanks a lot for help!