I had previously used Cassandra, and in that, it was advised not to use a secondary index on high cardinality columns. Does the same suggestion apply to Aerospike as well.
In my set, I have a bin which will have either 0 or 1 value. Basically 2 bins of the set(along with the PK) will be involved in my API.
Bin 1 contains the flag(integer) on which I want to use the secondary index.
Bin 2 contains the data(data type is bytes). I’m storing C++ structure in it.
Basically, the use case is that I have to update the data in Bin 2(if it meets certain conditions). The purpose of the flag is to distinguish the records containing data in Bin 2 with the records which don’t have any data(Please note that the number of records in the set is very high.)
So, my API has to fetch the PK and Bin 2 of only those records whose flag is set. After that I’ll be doing further processing. Basically I’ll be updating maximum 2 bins in this API(not the whole record).
Is it fine if I index Bin 1? Are there any other performance optimal recommendations?
I think we need more information to be able to tell you if this is a good option or not.
Can you provide more details around the use case and avg/max cardinalities/data sizes/calls per second/performance requirements/etc…?
The size of 1 record in this particular set will be ~7KB .
Basically my set will be having >100 Million records. The bin in which I’m interested in, will have a value, either 0 or 1 for every record. In my client API, I want to fetch the PK of all records having a particular value for that bin. The performance requirements from this particular API will not be high(Basically, I want my API to run successfully and expect to get a return value under 20 seconds.)
I want to fetch the PK of the set, using the secondary index, which is just a string of 20 characters. This API will not run very frequently, only occasionally.
Oh. You may not even need a secondary index if this is just for deletion. A background UDF would be perfect for this kind of thing, or applying a delete UDF in a query… Do you need to get the data back and do anything with it? I assume the 1 or 0 means processed or not? Are there other criteria? What is ‘occasionally’? Every 20ms? 20minutes? 20 days? Why don’t you delete the record when you mark it as a ‘1’ or ‘0’ ?
The answer to your question though is yes. It will work. But I’d like to understand more to know if this is what really should be done or if there is another feature or method we can leverage. Secondary indexes will cost you in terms of RAM especially, and they do not scale as well as some other methods… so there are some potentially other more optimal means of doing this… maybe.
I’m still interested to understand more about the flow… What do you expect to be the average/peak records returned by this query? If you’re getting 7K records back * 100million thats 700GB, so your client machine volume has to be considered too…
7 KB is the size of whole record. The record consists of 15 bins. In this specific API only 2 bins, along with the PK, are of my interest. Size of these 2 bins will be less than 500 bytes. I don’t have the exact average figure but it will be more than atleast 10 million.
When you are fetching these records and modifying the blob, is the operation the same in that moment for all records?
ex… fetch all records with ‘1’ then add ‘xyz’ to the blob?
No. I’ll first fetch Bin 2 of all those records whose value in Bin 1 is 1 and then only those records which fulfill certain conditions will be updated.
What kind of conditions? Aerospike can use UDF’s which can utilize C shared objects to perform operations on data without having to return to/from the client, which might work. It has huge processing potential… but of course if your app is doing special logic with libraries that can’t be ported or making external calls, things can get a little weird. https://www.aerospike.com/docs/udf/udf_guide.html Again, I really don’t understand your full use case so just throwing things out there.