Bulk/Batch Updates (AER-6499)

A few notes for now:

Use RPS when you have only one tx/rx in your NIC. See the best practices in our Amazon EC2 guide. Consider a multiqueue NIC.

Why is your high-water-memory-pct set to 85%? You have a 4 node cluster, and you don’t want to exceed 80% if a node goes down. So 80 * (4 - 1) / 4 = 60%

You should not be setting your high-water-disk-pct higher than 50% unless you also intend to raise your defrag-lwm-pct to the same value. As you do this and go above 50%, you’re going to incur a lot more defrag. See the knowledge base articles below:

Next, grep your log for cache-read-pct. You have plenty of DRAM defined for the namespace, and you should check if dynamically raising the namespace post-write-queue from the default of 256 blocks per-device up toward the max of 2048 (4096 in version >= 3.16) gives you a better cache hit rate. This is especially useful when your reads follow the writes closely, as is the case with updates.

Last thing for now, it doesn’t matter what the random read IOPS published by Samsung for the PM863 are. Their tests do not simulate a database workload. In Aerospike you’ll be combining random reads and writes and large-block reads for defrag, all at the same time. This is why Aerospike published the open source ACT tool. The PM863 has previously been rated as a 9x SSD (though you might be using a newer model that rates better). This means that with the 1.5K object and 128K write block it was measured to do 18Ktps reads concurrent with 9Ktps writes while not exceeding 5% > 1ms, 1% > 8ms, 0.1% > 64ms for 24 hours.

You can use ACT to measure your very specific workload, to your spec. Using that information you can size your cluster for your needs. Do not expect your SSD to hold up to more than what it was measured to do sustainably.