I can store 1 billion records if the cluster is new but can't reload it


#1

I’ve configured a three node cluster with each machine having 64GB memory and a 1TB SSD. I can start the cluster up fresh with my namespace, allocated with 60GB, and I seem to be able to load 1 billion records with no problem (but this may not be true). However if I stop the service and restart it again the log indicates that it loads about half the records and then spends eternity trying to make more room. It never succeeeds.

Has anyone loaded a billion records into an aerospike client and if so, what configuration/machine config did you use.


#2

You’d have to share your config and also clarify if that capacity you described is for the cluster or per-node.

1 billion objects * 64B * replication-factor 2 = 128GB. That wouldn’t fit on a single node, so I believe you’re talking about each node having 64GB of DRAM, 128GB for the cluster.

Assuming an even distribution of the data on the nodes, this is 42.66GB per-node. That is over the 60% high watermark for memory (which is the default value) of (60GB * 0.6=) 36GB. You’ll be kicking into evictions, especially when you take down a node. With your data distributed over two remaining machines you’re using 128GB of 120GB available to the cluster…

Take a look at the capacity planning article in the deployment guide. You want to make sure to have enough capacity to hold all your data on an N-1 cluster without breaching the high-water-memory-pct.

Also see:


#3

Thanks for that information. :grinning:

The following is the current relevant config I’m testing with (using three instances). It specifies twice the available actual memory on any one machine. I’m in the process of causing a failures and restarts to see what happens. Testing takes a long time because of the size of the data.

So my follow question is, if I increase the cluster to six instances of 64G and only specify 56G per instance, will that spread the load across the six machine in case of failure. That is, if one of the six fails, will that prevent any one machine from from running out of memory, and thus attempting evictions. I am specifically disallowing evictions as once the namespace is loaded, it is just read. Subsequent verions of the data get loaded into a second cluster on a periodic basis and the running software is reconfiured to point to the second cluster.

namespace XXX {
        replication-factor 2
#       memory-size 56G
        memory-size 128G
#       default-ttl 30d # 30 days, use 0 to never expire/evict.
        default-ttl 0 # 30 days, use 0 to never expire/evict.

        #storage-engine memory

        # To use file storage backing, comment out the line above and use the
        # following lines instead.
        storage-engine device {
#               file /opt/aerospike/data/XXX.dat
                device /dev/sdb
                #filesize 512G
                write-block-size 128K
                data-in-memory false
        }
}

#4

With 6 nodes the high-water-memory-pct can be set to 66%, and you’ll be fine even with a node going down. 56G x 6 = 336G for the cluster which should be plenty for 1B master objects and replication factor 2.

There are other ways to avoid evictions, rather than just setting a ‘never expire’ TTL on all the records. Do you really want to do this? If you’re loading data daily, how do you get rid of the previous day’s data? You could use truncate to get rid of the previous day’s data.

With the extra DRAM you now have, take a look at the post-write-queue, which may improve your read performance.


#5

Thanks. The nature of the data is that it is fully loaded periodically and then just read. So the expiration is set to 0 as we never want it to go away.

Regarding the 6 nodes, if the calculation is 56G x 6 = 336G then would 4 nodes work as well (with failure). 56G x 4 = 224. Would the high-water-mark be the same?

The default value for post-write-queue is 256. What do you recommend as value for this situation considering there will be nothing but reas once loaded?

And again, thanks for your help.


#6

1B objects with replication factor 2 needs 128G across the cluster. If your 4 node cluster loses a node, that will divide across 56G * 3 = 168G of capacity.

128G / 168G = 76%, so you’re not yet at stop writes at that point.

You can either raise or lower the post-write-queue, it really depends on your cache-hit percentage. You’ll have enough memory, so it’s worth the expense if you’re getting a decent cache hit.

tail -f /var/log/aerospike/aerospike.log | grep cache-read
asinfo -v 'set-config:context=namespace;id=<NAMESPACE>;post-write-queue=512'
tail -f /var/log/aerospike/aerospike.log | grep cache-read

The cost in terms of memory is post-write-queue x devices x write-block-size. In your case this is 256 x 1 x 128K = 32MB. Therefore even maximizing to 2048 x 1 x 128K = 512MB. Either way, this isn’t much memory to spend.

When you find the correct value for your use case, make sure to set it in your aerospike.conf