The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.
FAQ - What are sprigs
An Aerospike cluster has 4096 partitions per namespace, and keeps a copy of each per replica, based on the replication-factor
configuration. As the number of objects in the namespace increases, the time to traverse and access elements of the primary index increases. Refer to the primary index architecture page for details.
At the lowest level, Aerospike uses a red-black in-memory tree structure per partition. The more records, the more levels in the tree (depth).
There are two distinct ways this can impact the performance:
-
Tree traversal - On a node with high throughput (> 1 million) and a large number of records (> 1 million per partition), the time it would take to traverse the index tree to find a record (to read or update) could become bottleneck. If the tree is 20 or more levels deep, the tree search latency can become most of the transaction’s latency.
-
Lock contention - On a node with even modest throughput, if there is a large number of records, reducing an index tree can take a very long time – over 1 second. While reads and overwrites don’t contend with the reduce lock, creates and deletes do. Even at a small create/delete throughput, a one second blockage of all creates and deletes on a single partition can quickly block all (or most) transaction threads, holding up all other transactions (even if those would not themselves have been blocked). This can be exacerbated in the case of an nsup cycle generating a lot of deletes (expirations/evictions) while scans or migrations are in progress (also requiring partition reduce lock).
A general answer to both these problems is to divide a partition’s index tree into multiple sub-trees (or ‘sprigs’) to reduce tree depth/size and thus shorten search/traversal time; and also to support the presence of multiple tree/reduce lock pairs per partition, with each lock applying to one or more sub-trees, to reduce lock contention.
Server version 4.2 and above
Server version 4.2 enhances the storage efficiency and improves on speed. As part of it:
- The
partition-tree-locks
configuration was deprecated and now has its value fixed at 256 per partition. - Increased the minimum value for
partition-tree-sprigs
from 64 to 256, and set it as the default. The maximum allowed value is also increased from 4K to 256M, even though, the vast majority of use cases will likely not benefit much from values above 16K, or the memory overhead associated may not justify it. Enterprise Licensees should contact Aerospike Support before considering raising this to much higher values.
What is the overhead for those configuration parameters?
The namespace memory overhead per-node can be determined like this:
- A fixed base size of 64K.
- 8M times the replication factor for partition-tree-locks, divided by the number of nodes.
- 8B per partition-tree-sprigs times the replication factor, divided by the number of nodes.
- The Enterprise Edition requires an additional 5B per partition-tree-sprigs times the replication factor, divided by the number of nodes.
The memory overhead related to sprigs (shown below) is now spread throughout the cluster. That is, sprigs are only allocated for partitions that are owned by a node. Therefore, as a cluster gets bigger, the overhead per node decreases.
NOTE: The table below is for academic purposes. Workloads potentially requiring a high number of sprigs (higher than 32K) should be rare. Enterprise Edition licensees should contact Aerospike support for guidance. Even if the memory overhead seems acceptable, configuring too many sprigs may not only provide no benefits, but could actually adversely affect a cluster:
- A sub-cluster would have to accommodate for all the sprigs that were in the larger cluster (except if min-cluster-size has been configured to prevent the formation of such sub-cluster).
- The memory required would also have to be continguous (fragmented memory may prevent the allocation of memory).
- Having too many sprigs on a node could delay shut down and cause an unnecessary cold restart upon the subsequent restart.
Community Edition:
|------------------------++------------------------|
| || Memory Size |
| ||------------------------|
| partition-tree-sprigs || |
|========================++========================+
| 256 || 8MB |
|------------------------++------------------------+
| 512 || 16MB |
|------------------------++------------------------+
| 1024 || 32MB |
|------------------------++------------------------+
| 2048 || 64MB |
|------------------------++------------------------+
| 4096 || 128MB |
|------------------------++------------------------+
| 8K || 256MB |
|------------------------++------------------------+
| 16K || 512MB |
|------------------------++------------------------+
| ... || ... |
|------------------------++------------------------+
| 256M || 8TB |
|------------------------++------------------------+
Enterprise Edition adds an extra overhead for support write restart:
|------------------------++------------------------|
| || Memory Size |
| ||------------------------|
| partition-tree-sprigs || |
|========================++========================+
| 256 || 5MB |
|------------------------++------------------------+
| 512 || 10MB |
|------------------------++------------------------+
| 1024 || 20MB |
|------------------------++------------------------+
| 2048 || 40MB |
|------------------------++------------------------+
| 4096 || 80MB |
|------------------------++------------------------+
| 8K || 160MB |
|------------------------++------------------------+
| 16K || 320MB |
|------------------------++------------------------+
| ... || ... |
|------------------------++------------------------+
| 256M || 5TB |
|------------------------++------------------------+
Examples
Community, cluster size 1, partition-tree-sprigs 256: 64K + 8M + 8M ~= 16.06M
Enterprise, cluster size 1, partition-tree-sprigs 256: 64K + 8M + 8M + 5M ~= 21.06M
Community, cluster size 8, replication factor 2, partition-tree-sprigs 256: 64K + 2M + 2M ~= 4.06M
Enterprise, cluster size 8, replication factor 2, partition-tree-sprigs 256: 64K + 2M + 2M + 1.25M ~= 5.31M
Enterprise, cluster size 32, replication factor 2, partition-tree-sprigs 256: 64K + 512K + 512K + 0.3M ~= 1.37M
Community, cluster size 1, partition-tree-sprigs 4096: 64K + 8M + 128M ~= 136.06M
Enterprise, cluster size 1, partition-tree-sprigs 4096: 64K + 8M + 128M + 80M ~= 216.06M
Community, cluster size 8, replication factor 2, partition-tree-sprigs 4096: 64K + 2M + 32M ~= 34.06M
Enterprise, cluster size 8, replication factor 2, partition-tree-sprigs 4096: 64K + 2M + 32M + 20M ~= 54.06M
Enterprise, cluster size 32, replication factor 2, partition-tree-sprigs 4096: 64K + 512K + 8M + 5M ~= 13.56M
Server version prior to 4.2
Prior to version 3.11, there is only one index tree per partition, with one pair of locks (a regular lock and a reduce lock). Operations requiring the reduction (traversal) of the index (nsup, scan and migrations) as well as operations requiring adding or removing elements from the index (record creation/deletion) all need to acquire those locks. It typically takes around a second to reduce 4 million records (on modern systems), therefore a partition with more than 4 million records could take over a second to be fully reduced, impacting other operations requiring the lock on the same partition.
Following details are applicable only for server versions between 3.11 and 4.2
With 3.11, the following two parameters partition-tree-sprigs and partition-tree-locks have been introduced to minimize lock contention between tree/sprig traversal and reduce the traversal depth.
The partition-tree-sprigs configuration parameter defines the number of sprigs per partition. The partition-tree-locks configuration parameter defines the number of lock pairs (tree lock and reduce lock) per partition.
What is the overhead for those configuration parameters?
The namespace memory overhead per-node can be determined like this:
- A fixed base size of 64K.
- For pre-3.15.1, 320K per partition-tree-locks. For 3.15.1+, the overhead decreased by a factor of 10 (i.e. 32K per partition-tree-locks).
- 1M per 16 partition-tree-sprigs.
- Additionally the Enterprise Edition also requires an extra 320K per 16 partition-tree-sprigs to support fast restart.
You can also use the following tables to look up the memory overhead:
|------------------------++-----------------------------------------------|
| || memory overhead |
| ||-----------------------------------------------|
| partition-tree-locks || 3.11- 3.15.0 | 3.15.1+ |
|========================++========================+======================|
| 1 || 320K | 32K |
|------------------------++------------------------+----------------------|
| 2 || 640K | 64K |
|------------------------++------------------------+----------------------|
| 4 || 1.25M | 128K |
|------------------------++------------------------+----------------------|
| 8 || 2.5M | 256K |
|------------------------++------------------------+----------------------|
| 16 || 5M | 512K |
|------------------------++------------------------+----------------------|
| 32 || 10M | 1M |
|------------------------++------------------------+----------------------|
| 64 || 20M | 2M |
|------------------------++------------------------+----------------------|
| 128 || 40M | 4M |
|------------------------++------------------------+----------------------|
| 256 || 80M | 8M |
|------------------------++------------------------+----------------------|
|------------------------++-----------------------------------------------|
| || memory overhead |
| ||-----------------------------------------------|
| partition-tree-sprigs || Community version | Enterprise version |
|========================++========================+======================|
| 16 || 1M | +320K |
|------------------------++------------------------+----------------------|
| 32 || 2M | +640K |
|------------------------++------------------------+----------------------|
| 64 || 4M | +1.25M |
|------------------------++------------------------+----------------------|
| 128 || 8M | +2.5M |
|------------------------++------------------------+----------------------|
| 256 || 16M | +5M |
|------------------------++------------------------+----------------------|
| 512 || 32M | +10M |
|------------------------++------------------------+----------------------|
| 1024 || 64M | +20M |
|------------------------++------------------------+----------------------|
| 2048 || 128M | +40M |
|------------------------++------------------------+----------------------|
| 4096 || 256M | +80M |
|------------------------++------------------------+----------------------|
Examples
Community, partition-tree-locks 1, partition-tree-sprigs 16: 64K + 320K + 1M ~= 1.4M
Community 3.15.1+, partition-tree-locks 1, partition-tree-sprigs 16: 64K + 32K + 1M ~= 1.1M
Enterprise, partition-tree-locks 1, partition-tree-sprigs 16: 64K + 320K + 1M + 320K ~= 1.7M
The above three show that there is less overhead than in the pre-3.11 version which has a fixed overhead of 2M for the indexes.
This table can be used to look up the total memory overhead for some possible combination of locks and sprigs:
|------------------------++-----------------------------------------------------|
| partition-tree- || 3.11 - 3.15.0 | 3.15.1+ |
| locks | sprigs || Community | Enterprise | Community | Enterprise |
|========================++=============+=============+=========================|
| 1 | 16 || 1.4M | 1.7M | 1.1M | 1.4M |
|-----------+------------||-------------+-------------+-----------+-------------|
| 8 | 64 || 6.6M | 7.8M | 4.3M | 5.6M |
|-----------+------------||-------------+-------------+-----------+-------------|
| 16 | 128 || 13.6M | 15.6M | 8.6M | 11.1M |
|-----------+------------||-------------+-------------+-----------+-------------|
| 32 | 256 || 26.1M | 31.1M | 17.1M | 22.1M |
|-----------+------------||-------------+-------------+-----------+-------------|
| 64 | 512 || 52.1M | 62.1M | 34.1M | 44.1M |
|-----------+------------||-------------+-------------+-----------+-------------|
| 128 | 2048 || 168.1M | 208.1M | 132.1M | 172.1M |
|-----------+------------||-------------+-------------+-----------+-------------|
| 128 | 4096 || 296.1M | 376.1M | 260.1M | 340.1M |
|-----------+------------||-------------+-------------+-----------+-------------|
| 256 | 4096 || 336.1M | 416.1M | 264.1M | 344.1M |
|-------------------------------------------------------------------------------|
Some more examples for Enterprise before 3.15:
partition-tree-locks 8, partition-tree-sprigs 64: 64K + 2.5M + 4M + 1.25M ~= 7.8M
partition-tree-locks 8, partition-tree-sprigs 256: 64K + 2.5M + 16M + 5M ~= 23.6M
partition-tree-locks 16, partition-tree-sprigs 1024: 64K + 5M + 64M + 20M ~= 89.1M
partition-tree-locks 32, partition-tree-sprigs 4096, LDTs enabled: (64K + 10M + 256M + 80M) * 2 ~= 692.1M
Some more examples for Enterprise 3.15+:
partition-tree-locks 8, partition-tree-sprigs 256: 64K + 256K + 16M + 5M ~= 21.3M
partition-tree-locks 32, partition-tree-sprigs 4096: 64K + 1M + 256M + 80M ~= 337.1M
When should we increase sprigs and locks?
When upgrading to version 3.11 (or above), a cold restart is required as the DRAM index layout was changed because of the sprigs. Changing the partition-tree-sprigs is not dynamic and will force a cold start.
Configuring the right number of sprigs on a machine trades-off a bit of memory overhead for faster partition tree traversal and reduced lock contention, translating into big performance gains for some specific workloads. The default value of 8 partition-tree-locks and 64 partition-tree-sprigs is already a good improvement from pre-3.11 Aerospike releases.
The number of partition-tree-locks helps in situations when there are contentions acquiring the tree lock, which, for individual transactions, impact deletes and creates only.
The number of partition-tree-sprigs reduces the depth to traverse for finding records in the partition tree. For non data-in-memory namespaces, though, the disk io is likely to be much more impactful than the time to traverse the index in memory. Therefore this would probably have little impact for non data-in-memory namespaces.
Since changing partition-tree-sprigs will force a cold start, it may be helpful to increase it to the maximum during the upgrade to a value per your server version if there is available memory in order to avoid future cold restarts. And it is always recommended to benchmark per your usage.
Expect improvements when there is a large number of records per partition (> 1 million per partition).
How should we change them?
Both parameters are static configuration so a rolling restart is required. See their reference pages for allowable range of values.
For server version prior to 4.2, a good minimal guideline is to stay with the default of 8 partition-tree-locks until the cluster size exceeds 15, then double it at every cluster size doubling (16 for cluster sizes 16 to 31, 32 for cluster sizes 32 to 63, etc). Indeed, the larger the cluster is, the fewer partition each node will own, creating more potential contention on the locks. But if there is available memory, it should be ok to set it to the maximum allowed for both configuration parameters.
|--------------------++----------------------------+----------------------|
| cluster-size || recommended tree-locks | tree-sprigs |
|====================++============================+======================|
| || | |
| 1 ... 15 || 8+ | 256+ |
| || | |
|--------------------++----------------------------+----------------------|
| || | |
| 16 ...31 || 16+ | 256+ |
| || | |
|--------------------++----------------------------+----------------------|
| || | |
| 32 ...63 || 32+ | 256+ |
| || | |
|--------------------++----------------------------+----------------------|
| || | |
| 64 ... 127 || 64+ | 256+ |
| || | |
|--------------------++----------------------------+----------------------|
For server version 4.2 and above, the partition-tree-locks
is set to 256 but the partition-tree-sprigs
can be increased to at-least 8K (has the same overhead as 4k in older versions) and even 16K as memory permits. For higher values than this, always better to benchmark it against traffic load.
Notes
-
For server version prior to 4.2, partition-tree-locks cannot exceed partition-tree-sprigs.
-
As always, we recommend to testing the performance impact of those settings in development/staging environments first, prior considering them for production.
References
Configuration parameters
Improvements in 4.2:
Improvements in 3.11:
Cold start:
Keywords
TREE LOCKS SPRIGS CONTENTION PARTITION PRIMARY INDEX SIZING
Timestamp
06/08/2018