When not to use a secondary Index

vishal14101993 · June 28, 2018, 12:13pm

I had previously used Cassandra, and in that, it was advised not to use a secondary index on high cardinality columns. Does the same suggestion apply to Aerospike as well.

In my set, I have a bin which will have either 0 or 1 value. Basically 2 bins of the set(along with the PK) will be involved in my API.

Bin 1 contains the flag(integer) on which I want to use the secondary index. Bin 2 contains the data(data type is bytes). I’m storing C++ structure in it.

Basically, the use case is that I have to update the data in Bin 2(if it meets certain conditions). The purpose of the flag is to distinguish the records containing data in Bin 2 with the records which don’t have any data(Please note that the number of records in the set is very high.)

So, my API has to fetch the PK and Bin 2 of only those records whose flag is set. After that I’ll be doing further processing. Basically I’ll be updating maximum 2 bins in this API(not the whole record).

Is it fine if I index Bin 1? Are there any other performance optimal recommendations?

Albot · June 29, 2018, 8:25am

I think we need more information to be able to tell you if this is a good option or not. Can you provide more details around the use case and avg/max cardinalities/data sizes/calls per second/performance requirements/etc…?

vishal14101993 · June 29, 2018, 9:43am

The size of 1 record in this particular set will be ~7KB .

Basically my set will be having >100 Million records. The bin in which I’m interested in, will have a value, either 0 or 1 for every record. In my client API, I want to fetch the PK of all records having a particular value for that bin. The performance requirements from this particular API will not be high(Basically, I want my API to run successfully and expect to get a return value under 20 seconds.)

Albot · June 29, 2018, 9:47am

How often will you query? How much data do you expect to be returned on average and on max?

vishal14101993 · June 29, 2018, 9:49am

I want to fetch the PK of the set, using the secondary index, which is just a string of 20 characters. This API will not run very frequently, only occasionally.

Albot · June 29, 2018, 10:10am

Oh. You may not even need a secondary index if this is just for deletion. A background UDF would be perfect for this kind of thing, or applying a delete UDF in a query… Do you need to get the data back and do anything with it? I assume the 1 or 0 means processed or not? Are there other criteria? What is ‘occasionally’? Every 20ms? 20minutes? 20 days? Why don’t you delete the record when you mark it as a ‘1’ or ‘0’ ?

Albot · June 29, 2018, 10:31am

The answer to your question though is yes. It will work. But I’d like to understand more to know if this is what really should be done or if there is another feature or method we can leverage. Secondary indexes will cost you in terms of RAM especially, and they do not scale as well as some other methods… so there are some potentially other more optimal means of doing this… maybe.

Albot · June 29, 2018, 10:35am

I’m still interested to understand more about the flow… What do you expect to be the average/peak records returned by this query? If you’re getting 7K records back * 100million thats 700GB, so your client machine volume has to be considered too…

vishal14101993 · June 29, 2018, 10:41am

7 KB is the size of whole record. The record consists of 15 bins. In this specific API only 2 bins, along with the PK, are of my interest. Size of these 2 bins will be less than 500 bytes. I don’t have the exact average figure but it will be more than atleast 10 million.

Albot · June 29, 2018, 10:52am

When you are fetching these records and modifying the blob, is the operation the same in that moment for all records? ex… fetch all records with ‘1’ then add ‘xyz’ to the blob?

vishal14101993 · June 29, 2018, 10:54am

No. I’ll first fetch Bin 2 of all those records whose value in Bin 1 is 1 and then only those records which fulfill certain conditions will be updated.

Albot · June 29, 2018, 11:42am

What kind of conditions? Aerospike can use UDF’s which can utilize C shared objects to perform operations on data without having to return to/from the client, which might work. It has huge processing potential… but of course if your app is doing special logic with libraries that can’t be ported or making external calls, things can get a little weird. https://www.aerospike.com/docs/udf/udf_guide.html Again, I really don’t understand your full use case so just throwing things out there.

vishal14101993 · June 29, 2018, 12:04pm

Thanks for your support. I’ve not explored UDF till now. I think it might be helpful in my case. Will definitely explore it now.

Albot · June 29, 2018, 12:14pm

You can also utilize UDF in conjunction with a secondary index… to get potentially even more throughput. But with a high number of objects like that it will be expensive in terms of memory. https://www.aerospike.com/docs/operations/plan/capacity/secondary_indexes.html

Test a few things out and see what works best for you. Maybe someone else will chime in too. I still think it would be nice to know your use case more

Topic		Replies	Views
Do we need to create Indexes for all bins that i need to search on?	2	1149	February 26, 2015
Secondary index on bin with type map object Query & Indexing secondary , index	1	1852	September 10, 2014
Secondary Index query performance Query & Indexing query , scan , index	7	183	April 11, 2024
Secondary index with low cardinality or huge fanout	2	388	March 26, 2024
Querying by partition Query & Indexing	0	1398	February 17, 2016

When not to use a secondary Index

Related topics