Size of the primary index

The following document

says

Primary Index

Calculated via:

64 bytes × (replication factor) × (number of records)

Please can you shed some light on why there is need 64 bytes per record? Is there flexibility to allow for shorter hashes?

Thank you for your help

Primary index entry stores following information

  • Digest
  • void_time
  • generation
  • Tree related metadata
  • Pointer to data in memory if data is in memory
  • Location of data on storage disk.

All this add up to 64bytes. Also it is at 64bytes because of cacheline size which makes life much more tractable in terms of cache misses.

No there is no flexibility right now.

– R

1 Like

mlabour,

I am assuming that need is to reduce the memory footprint of the primary.

If not, could you please elaborate on other benefit you are looking at??

– R

Yes you are correct.

As we are sizing the cluster, we are looking at the requirements in RAM for the index.

Question: If we scale out by adding machines, does the size of the index per machine decrease? On other words, If I have one machine with an index of 256 GB, then does adding another machine decrease the size of the index by 2?

Thank you for your help

If you are using replication factor 1 then going from 1 node to 2 nodes would require half the space per node. However, the primary index slabs are never freed, they are reused but if you add a second node and do not expect an increase in the total number of records then the original node’s primary index will occupy twice the required RAM–this can be reclaimed with a coldrestart of the daemon.

For replication factor 2: If you go from a single node cluster to 2 nodes then both nodes will need the same amount of index space each that the original node had as a single node cluster. The reduce the required amount of RAM for the index by a factor of 2 in a replication factor 2 environment you would need 4 nodes.