Secondary index on huge number of records




Is it advisable creating secondary index on the set having huge number of records (like 30 million records, and all records having the bin for which I want to create the secondary index) ?

If it is not the good option, what I should do to query on that particular bin?

Thanks Rafiq


You can do it, but it is typically not advised to create secondary index for sets over 1M records unless absolutely required. Generally, a reference set is the preferred option as it will scale well and perform more consistently under migrations. A reference set is a set that only contains the values which you need to get your PK of the target set. This method does require writing twice and reading twice, but has benefits.


  • setProducts: TeddyBear,13.40,StuffedAnimal
  • setProductRef: StuffedAnimal,Map{TeddyBear,FluffyKittens}

I hope that makes sense… I think we could be more helpful if you could describe more of your use case.

  • What is the use case?
  • How is the data structured?
  • What is the maximum “many” (cardinality) in the one-to-many relationship?
  • What is the maximum number of records you need to store?
  • How many records are added per second?


Thanks Albot.

Now, my question is, in which aspects creating a reference set is the best option rather creating the secondary index? How the secondary index will drop the performance? Please explain.

I’ll provide the information that you want, if needed after seeing your reply :slight_smile:

Thanks in advance.


Well it all depends on your requirements. Secondary index will definitely work, but the memory will be more expensive and like I said may not scale as well.


Basically, if we replace the secondary index with the reference set (which will have a primary index), that would be less expensive than having the secondary index? Does secondary index occupy more memory than the primary index?


Potentially. You need to go through the sizing document.

That’s not the only drawback though, as i mentioned - reference sets are more reliable during maintenance and typically perform better at high volumes >1M depending on the use case, notably write volume and cardinality.


@Rafiq_Ahmed the secondary index will usually consume less memory than the reference set, though you have to consult the capacity planning guide. The reference is a separate record, consuming 64B of metadata in DRAM (as does every record). The advantage of it, as @Albot pointed out, is much higher performance (especially at scale) and accuracy (when the cluster size changes and data is migrating).

It’s yet another speed vs. space tradeoff - you need more storage (DRAM and SSD) for the references, but you get higher performance by converting a scan/query to a small sequence of key-value operation.


Thanks @Albot and @rbotzer