Aerospike node asserts and shuts down with 'Too Many Chunks' during migration

Aerospike node asserts and shuts down with ‘Too Many Chunks’ during migration

Problem Description

An Aerospike node with an all-flash configured namespace (index on disk) shuts down with the following message shown in the log.

Oct 07 2021 12:51:23 GMT: CRITICAL (arenax): (arenax_ee.c:98) too many chunks
Oct 07 2021 12:51:23 GMT: WARNING (as): (signal.c:218) SIGUSR1 received, aborting Aerospike Enterprise Edition build 5.6.0.8 os el7
Oct 07 2021 12:51:23 GMT: WARNING (as): (log.c:630) stacktrace: registers: rax 0000000000000000 rbx 000000000000000a [...] 0000000000001926 r13 00007f46bbd67808 r14 00007f4971a00800 r15 000000000bcea660 rip 00007f4983521690
Oct 07 2021 12:51:23 GMT: WARNING (as): (log.c:643) stacktrace: found 11 frames: 0x6862a1 0x4ef4fb 0x7f49835217e0 0x7f4983521690 0x685a67 0x666ae7 0x4abdbc 0x4abe18 0x6746a7 0x7f498351740b 0x7f49820c50bf offset 0x0

Explanation

This error occurs when there is a serious misconfiguration of partition-tree-sprigs on the all-flash namespace.

A sprig is a branch of the primary index. When the index is held in RAM the number of branches is usually unimportant until the index becomes very large. Increasing the number of sprigs increases index efficiency at the expense of memory consumption.

When the index is on disk (all-flash) the number of sprigs becomes much more important. Indeed, for such configuration, as disk access would typically consist of 4 KiB blocks reads, sprigs would ideally be fully contained in a single 4 KiB disk block (chunk) in order for a record lookup to consist of 1 disk I/O and no more.

When sizing all-flash installations, care is taken to estimate the size of the index required and to calculate the right number of partition-tree-sprigs such that all index entries are stored at the desired ‘fill fraction’ (see below) and that each sprig uses a single 4 KiB chunk and no more.

This is not enforced. There is no limit in the code to the number of chunks per sprig. It is possible, though deeply inadvisable, to have sprigs consist of many 4 KiB chunks. This means that for each record lookup there would be, potentially, a large number of disk I/O operations, which would be extremely detrimental to performance. The system will allocate as many chunks as it needs to store the records that are loaded in. The number of chunks is not directly configurable. The number of partition-tree-sprigs is used as an indirect control.

The error above occurs when a partition is dropped where the sprigs have more than 100 chunks. When the chunks are cleaned up for re-use there is a sanity check and if the sprig has more than 100 chunks it is assumed that the sprig is corrupt and the node shuts itself down.

Solution

If this error is observed, it is indicative of a major misconfiguration. The sizing should be reviewed carefully, if need be with an Aerospike Solutions Architect before nodes are restarted,

Notes

  • The Fill Fraction defines the level to which a sprig is filled to allow for some expansion without overfilling and consuming more than one chunk per sprig
  • The Linux Capacity Planning Guide gives details on how to size all-flash installations correctly.

Applies To

Server 4.2.0.2 or later.

Keywords

ALL-FLASH MIGRATE CHUNKS TOO MANY

Timestamp

October 2021

© 2021 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.