Max limits of records in a server


#1

Hey,

I’m new to aerospike and consider using it for our project. We collect several hundred billion records in a years time.

If I understood correctly from the documents, an aerospike server is limited to a record count that is determined by the RAM.

Is that per namespace or for all namespace? Lets say my server has 64GB ram. Will it only be able to contain roughly 1Billion records?

I’m wondering because the hash of the primary key was said to be unique per namespace. So I’m wondering if the limit is per namespace or global?

Best regards,


#2

There’s no artificial limit for records. You set the storage type for each namespace so you can have one namespace store everything in RAM while another stores everything to SSD. Indexes are always in RAM.

If you’re storing everything in RAM then you’ll need to calculate the size of your records + index entry to figure out the rough amount of records you can store. A cluster can increase your capacity (by adding more servers) but remember to keep in mind the replication factor (how many copies of your data) that’s used.


#3

According to the documentations, ALL primary keys are saved on RAM, Whether or not the namespace is set to SSD storage or RAM.

Each key is 64 bytes, thats why for 64GB I assume hard limit of 1B record entries. I’m just wondering if that’s really the limitation or there is some more hope as the key must be unique per namespace?


#4

Yes each index entry is 64 bytes so 64GB of memory will hold 1 billion key entries max globally.

The reality will be less depending on if you save your records in RAM too and how big your cluster is, replication factor and how much memory you have allocated for data (compared to the OS + software + other overhead).

However - you can use bins of a record to store multiple values for each key and expand that way if you need to store more items.


#5

It will work like this:

1 server * 64GB = max 1B objects 2 server * 64GB = max 2B objects … 10 server * 64GB = max 10B objects.

Note: This is without any replication. If you want 2 copies per record, divide the numbers above by 2. But you shouldn’t and can’t use full 64gb… calculate with a maximum of 3/4th the size of your memory. More info’s can be found at http://www.aerospike.com/docs/operations/plan/capacity/

But manigandham is right, you should group a “SET” of values into records. This would reduce meta overhead alot.


#6

unfortunately each record could indeed theoretically hold many bins with different values, however, they will be required to be searched on according to their inner objects. That will require secondary indices and those too are kept only in memory.

Not sure if that will help or not.


#7

Well, that depends a lot on your data:

Are those bin values unique? No way around 64b/key with Aerospike

Are their values shared among many “records”? If so, you could also get around the secondary indices by having a record for each such value, each having a large ordered list of primary keys with that value (e.g. a common value like operation system or browser agent. You can then query by looking up on that list and in a secondary step batch_get all records by primary key that you are interested in. Whether that is acceptable in terms of query latency, you need to decide.

I would say that data modelling is crucial to success and costs with any NoSQL-DB.


#8

Manigandham is exactly correct on the amount of RAM used by Aerospike per primary index entry. This is a primary factor in the correct sizing of a cluster.

ManuelSchmidt is right on the money about data modeling being crucial. You may not need 100billion records based on your data model. A primary key in Aerospike points to a Record, which can contain 1 or more fields called Bins, so you can store more that one value per key if this makes sens for your use case. He is also 100% correct on the use of secondary indexes. These give you a “query” capability but at a longer latency.

What you need to do is understand exactly your read/write throughput and latency needs to be to meet your SLA, and exactly what you need to store to satisfy your business transactions. Then model the data to suit those needs.

One tip on modeling for NoSQL: Stop thinking rows, columns and normal form - denormalize

The data model will be closer in structure to your heap variables (object graph or data structures) than rows and columns. Think of Aerospike as a huge associative array or hash table and then model accordingly.

I hope this helps

Peter


#9

To be 100% clear, though, on this topic, there is actually a limit on the number of records that can be held in the Index per namespace, per node. See: http://www.aerospike.com/docs/guide/FAQ.html. The addressing space for records is 4 bytes (out of the 64 bytes in the index):

The maximum number of records per namespace on a given node is limited to 4,294,967,296 (2^32 due to 4 bytes used for storing references). This represents 256GiB (4,294,967,296 * 64 bytes).