The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.
FAQ - All Flash
As of server version 4.3, Aerospike can be configured to store the Primary Index on NVMe Flash devices. This article covers some frequently asked questions regarding this feature.
1. For the storage device, can the same mount be used to store the data or does it have to be separate?
The devices for storing the Primary Index and the Data should be different.
2. Can two namespaces share the same location to store indexes? For example, if a drive has more capacity than needed for one namespace’s indexes.
There may be more than one mount per namespace. Although not recommended, a mount may be shared with other namespaces. For sizing details when using index-type
flash
, refer to the Capacity Planning page.
Refer to the mount
configuration reference for further details.
Mounts can be shared across namespaces. This is because a mount is a directory, and the actual files are the index arena stages, and the names of these files have the namespace (and instance) IDs built in. So files for different namespaces and instances can coexist in a directory. For example, namespace1 uses mount /mnt/nvme with size 4GiB and namespace2 also uses mount /mnt/nvme with size 8GiB (assuming nvme is at least 12 GiB).
Since sharing is possible, there is a configuration item to indicate index device quotas for namespaces by mounts-size-limit
.
This limit is enforced only via eviction, for which there is a configurable threshold – mounts-high-water-pct
. While mounts-size-limit
is not a hard limit (a namespace doesn’t have to do expiration and eviction) it must be configured anyway. The minimum allowed is 4 GiB, and the maximum can’t exceed the actual space available on the mounts.
Note that while sharing mounts across namespaces is possible, it is not recommended. It may instead be beneficial for performance to use multiple mounts (and underlying devices) for one namespace.
3. How do I monitor space used for the index?
It’s essential to understand that All Flash configurations pre-allocate the index space. For example, an 8-node cluster configured with 32768 partition-tree-sprigs
(sprigs per partition) and replication factor of 2 would mean that the sprigs themselves would need 128GiB of index device space on each node:
4096 (num. partitions) * 2 (RF) * 32,768 (num sprigs per partition) * 4KiB) / 8 (num nodes) = 128GiB)
And the amount of memory consumed by 32,768 primary index sprigs on the same namespace is 3.25GiB for a cluster running on the Enterprise Edition (each sprig has an overhead of 13 bytes and a namespace has 4096 partitions times a replication factor of 2).
Statistics
The statistic to monitor are index_flash_used_pct
to see the used percentage and index_flash_used_bytes
to see the usage in bytes.
Important: note that those statistics show the usage based on the number of records rather than the number of sprigs instantiated. Initially, sprigs would be instantiated with the first record they would contain, causing the primary index mount to roughly fill up at a pace of 4KiB per record inserted, until all the sprigs configured have been instantiated. The primary index mount usage can be checked directly on the system. Once all the sprigs are instantiated, the primary index disk usage would remain stable until sprigs start reaching their 4KiB initial allocation (they would then hold 64 records) and overflow into a second 4KiB block. This would of course impact performance and would likely require a re-sizing.
Server Logs
There is a INFO log entry for index-flash-usage for each namespace:
{ns_name} index-flash-usage: used-bytes 5502926848 used-pct 1
This is printed periodically every 10 seconds, for each namespace configured with index-type
flash.
Refer to to server log reference documentation for more details on the log line.
4. How important is it to set the partition-tree-sprigs and determining fill-factor based on estimated records/current records?
If the namespace is projected to grow rapidly, a lower fill fraction would be more adequate in order to leave room for future records. Full sprigs will span more than a single 4KiB index block, and will then require more than a single index device read, impacting performance. Modifying the number of sprigs, to mitigate such a situation, requires a cold start to rebuild the primary index. It is therefore essential to determine the adequate fill factor in advance. Refer to the Capacity planning documentation for All-Flash for more details.
Another important point to consider is when the cluster size reduces (planned maintenance, unexpected shut down, network partition splits). The min-cluster-size
configuration parameter prevents sub-cluster below the configured minimum size to form, preventing a quick proliferation of sprigs from previously not own partitions to fill up the primary index mounts.
Refer to Index Device Space for All-Flash for further details.
If the number of records is not expected to drastically change, a higher fill fraction would help lower the time to traverse the index (for scans, migrations, nsup cycle, and other operations that traverse the full primary index). Fuller sprigs mean fewer device read operations for a given number of records – each read simply fetches more records into memory.
5. The configuration documentation states that it requires a feature-key-file, is there an additional cost for this feature?
The All-Flash feature does require a feature-key and has an additional cost. Please contact your Aerospike Account Owner for pricing.
6. How do I get information about dirty pages of the Primary Index?
From Aerospike 4.8.0 a new command has been introduced to give more detailed information around memory used by the primary index. The command is index-pressure
and the format is shown below:
$ asinfo -l -v index-pressure
test:1630212096:69632
The two numbers given above indicate the amount, in bytes, taken by the primary index and the amount of that which is dirty (not flushed to disk). In the example above there is ~1.5GiB used in total for the index and around 68KiB is dirty.
The command works for both hybrid storage and all-flash but gains particular utility when the index is on disk. In an all-flash mode the index-pressure
command gives an indication of how far behind the index write-back is lagging.
With index on disk primary indexes are mmap()ed. They are modified as if they were in RAM. When an index entry is touched, the kernel brings the corresponding page from the index drive to RAM. If an index entry is modified, the kernel lazily writes the corresponding modified page from RAM back to the index drive. The RAM the page used then becomes available again for other purposes.
Pages that have been modified, but not yet written back to the index drive are dirty pages. When the write-back process cannot keep up with index modifications, dirty pages pile up in RAM, consume more and more RAM, to a point where the system may run out of memory.
The index-pressure
command gives the amount of RAM taken up by a namespace’s primary index pages that are currently cache in RAM as well as how many of them are dirty. The higher the dirty value, the more the write-back is lagging.
References
Timestamp
September 2019