Storing >10billion small keys



I am testing Aerospike to see if it could be a good Cassandra replacement.

Here is my use case:

  • Being able to store around 17 billion keys on a cluster.
  • The structure of the values is a JSON converted into a bin map or a few bins (it does not really matter at the moment as we can change the app logic accordingly). JSON structure:

url : an website URL (string) timestamp: int status: boolean

My issue: As the value of each key is relatively small (around 200B), we will hit the memory limit of our Aerospike nodes way before hitting the SSD disk space limit.

Here is a small capacity planning test: I have inserted a small part of our data-set on one node (replication-factor 1):

  • Number of key inserted: 10 million
  • Memory used: 610MB
  • Disk used: 3.5GB

Those figures growth linearly if I add more data. So based on that, to store 17B keys we will need around 1TB of memory (or 52 Aerospike nodes of 64GB of memory with a replication-factor of 2 and 60% of their memory used).

Is there a way to reduce the memory footprint of Aerospike for this kind of use case? I have though about putting several json documents into one big bin map to create a sort of sub-keys bucket (sub-keys grouped by URLs domain name as example) but I cannot see how to retrieve a document without doing too many operations.

Thanks for your help. Antoine


Each record will always cost 64B of metadata in the in-memory primary index. I’m not sure if you’re on AWS, or on bare metal. The point would be to have a machine whose disk to Ram ratio fits your use case best. If you are in Amazon EC2, you can see that there’s a fixed ratio for all the instances of the i2, r3, and c3 instance type families.

The i2 instances have a 26.22:1 disk-to-DRAM ratio, r3 instances have a 2.62:1 ratio, c3 instances have a 10.66:1 ratio. With bare metal you could spec a server node that fit your use case closely, allowing for much smaller clusters and more efficient use of your resources.

Otherwise, you have the option of raising the high-water-memory-pct above 60%, once you calculate how much of that reserve you’ll be using if a node goes down. That depends on the number of nodes in the cluster. In a small, 3-node cluster that’s the correct number. The more nodes you have, the less a single node going down will increase the memory usage on the remaining nodes.

If you’re considering an Enterprise Edition trial, sizing and configuration would be part of Aerospike helping you have a successful PoC. The use case you’ve described is a very common one for people migrating from Cassandra/DataStax to Aerospike.