What happens when a client writes a record to Persisted Storage in the Aerospike database?


When the client writes a record to the database, the server performs a series of steps on the record:

  1. The record is first written to the record block. The record block is 128 bytes, exactly (Prior to versions 2.7.0 and 3.2.0, the record block size was 512 bytes.). A record can span more than one record block, but a record block only holds the contents of a single record. If a record is larger than one record block, but smaller than two, the extra bytes are not reclaimed.

    For example, if a record is 129 bytes (including metadata overhead) , it spans two record blocks, and uses 256 bytes. 127 bytes are lost: 2 * 128 = 256 and 256 – 129 = 127.

    Refer to the capacity planning guide for further details on record sizes on disk.

  2. Record blocks are written to a streaming write buffer (swb), also sometimes referred to as a wblock. On install, the default wblock size is 128 KiB for the persisted namespace configuration example. If you omit the write-block-size from the configuration file, the default wblock size is 1 MiB.

    While one record can span multiple record blocks, it cannot span multiple wblocks. The write-block-size is the maximum size for a record.

  3. By default, the swb (wblock) is flushed to the Write Queue when one of two conditions is true:

    • When it is full.
    • When it has not been flushed within the last one second. (You can also configure this interval with the flush-max-ms parameter.

    An option to flush to disk on every write will also be possible in future releases.

Overview of writing a record to persistent storage with replication factor greater than 1 :

  • Client finds the node owning the master copy of the partition the record belongs to (using the partition map) then makes write request to that node. Will call this node A for the sake of this example.
  • Node A makes an entry into the index (in RAM) and writes the record in a streaming write buffer (swb).
  • Node A then replicates the write to the node(s) owning other replica copy for the partition the record belong to through fabric if the replication factor is greater than 1.
  • Node A returns a success ACK to Client once the record is written to all replica’s Streaming Write Buffer.
  • Node A and the relevant nodes owning replica copies asynchronously flush their swb to persistent storage.
  • Finally the Index is updated in memory and points to the location on the disk.

Details of writing to Persistent Storage:

  1. A record that has been written to the server has a hash of the record’s primary key. Depending on how the client is written, either the client or the server creates the hash with the RIPEMD160 algorithm.

  2. The client uses the partition map to determine the node and partition to write the data to. The data is written to the streaming write buffer (swb) that will be eventually be flushed to disk when full or after the flush-max-ms threshold. The swb is also kept in the post-write-queue except if data-in-memory is set to true, in which case the block will also be written and kept in memory. The client gets an ACK as soon as the record is written in memory (in both master and replica). Flushing to the disk will happen asynchronously when the streaming write buffer is full. After writing the record, the server either creates or updates a 64 byte index entry in memory.

  3. The index’s metadata contains the storage address of the data in long-term storage. The location is either:

    • The location on hard disk where the data resides.
    • The in-flight location where the data resides before it is written to hard drives.
    • The post-write queue (except if data-in-memory is set to true).
  4. With replication factor is greater 1 the write is also synchronusly written to the nodes owning replica copies of the partition.

Miscellaneous details on Delete/Update/Expire/Evict:

  • If a record is updated, Aerospike writes a new copy of the record to disk. It also updates the TTL for the record (by default - this can be omitted) and it updates the disk address in the primary index.
  • If a record is not touched or updated until its TTL value is reached, it expires and will be deleted as part of the subsequent nsup cycle (and will not be able to be retrieved by a client even if it has not been deleted yet).
  • When the database breaches either high-water-memory-pct or high-water-disk-pct thresholds, records will be evicted.
  • If a record is deleted, the index in memory that points to the record is removed (unless if the durable delete policy is used). The record’s size is deducted from the wblock usage for the wblock where the record lived. If the wblock usage drops below the defrag-lwm-pct threshold, the wblock is placed in the defrag queue. When the wblock usage drops to 0, the wblock is added to the free wblock queue, where it is used and its previous content is overwritten.
  • If you are storing records on disk, and the server is rebooted, the indexes are rebuilt from disk. If the location on disk of the deleted record was not overwritten prior to reboot, the record will be indexed during cold restart. The record then returns from the disk to the database as a zombie record (unless if the durable delete policy was used).

Disk high-watermark in AMC graph
Multiple records after aerospike restart
Should NAS and SSD give same write performance?