I am testing Aerospike to see if it could be a good Cassandra replacement.
Here is my use case:
- Being able to store around 17 billion keys on a cluster.
- The structure of the values is a JSON converted into a bin map or a few bins (it does not really matter at the moment as we can change the app logic accordingly). JSON structure:
url : an website URL (string) timestamp: int status: boolean
My issue: As the value of each key is relatively small (around 200B), we will hit the memory limit of our Aerospike nodes way before hitting the SSD disk space limit.
Here is a small capacity planning test: I have inserted a small part of our data-set on one node (replication-factor 1):
- Number of key inserted: 10 million
- Memory used: 610MB
- Disk used: 3.5GB
Those figures growth linearly if I add more data. So based on that, to store 17B keys we will need around 1TB of memory (or 52 Aerospike nodes of 64GB of memory with a replication-factor of 2 and 60% of their memory used).
Is there a way to reduce the memory footprint of Aerospike for this kind of use case? I have though about putting several json documents into one big bin map to create a sort of sub-keys bucket (sub-keys grouped by URLs domain name as example) but I cannot see how to retrieve a document without doing too many operations.
Thanks for your help. Antoine