Secondary index on huge number of records

Rafiq_Ahmed · January 25, 2018, 9:51am

Hi,

Is it advisable creating secondary index on the set having huge number of records (like 30 million records, and all records having the bin for which I want to create the secondary index) ?

If it is not the good option, what I should do to query on that particular bin?

Thanks Rafiq

Albot · January 25, 2018, 10:57pm

Generally, a reference set is the preferred option as it will scale well and perform more consistently under migrations. A reference set is a set that only contains the values which you need to get your PK of the target set. This method does require writing twice and reading twice, but has benefits.

ex…

setProducts: TeddyBear,13.40,StuffedAnimal
setProductRef: StuffedAnimal,Map{TeddyBear,FluffyKittens}

I hope that makes sense… I think we could be more helpful if you could describe more of your use case.

What is the use case?
How is the data structured?
What is the maximum “many” (cardinality) in the one-to-many relationship?
What is the maximum number of records you need to store?
How many records are added per second?

Rafiq_Ahmed · January 29, 2018, 1:35pm

Thanks Albot.

Now, my question is, in which aspects creating a reference set is the best option rather creating the secondary index? How the secondary index will drop the performance? Please explain.

I’ll provide the information that you want, if needed after seeing your reply

Thanks in advance.

Albot · January 29, 2018, 11:09pm

Well it all depends on your requirements. Secondary index will definitely work, but the memory will be more expensive and like I said may not scale as well.

Rafiq_Ahmed · January 30, 2018, 6:32am

Basically, if we replace the secondary index with the reference set (which will have a primary index), that would be less expensive than having the secondary index? Does secondary index occupy more memory than the primary index?

Albot · January 31, 2018, 1:03am

Potentially. You need to go through the sizing document. https://www.aerospike.com/docs/operations/plan/capacity

That’s not the only drawback though, as i mentioned - reference sets are more reliable during maintenance and typically perform better at high volumes >1M depending on the use case, notably write volume and cardinality.

rbotzer · January 31, 2018, 6:07am

@Rafiq_Ahmed the secondary index will usually consume less memory than the reference set, though you have to consult the capacity planning guide. The reference is a separate record, consuming 64B of metadata in DRAM (as does every record). The advantage of it, as @Albot pointed out, is much higher performance (especially at scale) and accuracy (when the cluster size changes and data is migrating).

It’s yet another speed vs. space tradeoff - you need more storage (DRAM and SSD) for the references, but you get higher performance by converting a scan/query to a small sequence of key-value operation.

Rafiq_Ahmed · January 31, 2018, 2:33pm

Thanks @Albot and @rbotzer

Topic		Replies	Views
Secondary index vs reference set	4	671	November 9, 2022
Swaping memory to disk with large datasets	3	1040	October 9, 2017
When not to use a secondary Index index	13	1485	June 29, 2018
Capacity Planning - indexes Planning secondary , primary , index	7	2797	May 4, 2015
Secondary Index - Additional data storage	3	801	October 20, 2021

Secondary index on huge number of records

Related topics