Buffering and Caching in Aerospike


#1

Buffers and caches used by Aerospike:

  • current write block - When a record is written and needs to be stored on disk, it is put into an in-RAM buffer that holds the current write block, i.e., the write block that asd is currently filling up. When the current write block is full, then it is persisted to disk and asd starts a new write block. Thus, all writes to a write block are coalesced into a single, write-block-size device write. This leads to low write IOPS.

  • post-write queue - The most recently persisted write blocks are kept in RAM after being written to disk. The idea is that subsequent reads are more likely to hit recently written records than older records. In particular, XDR would hit recently written records. If such a record is read, its data can be retrieved from the write block in the post-write-queue that contains the record’s data rather than having to be retrieved from the device.

  • page cache - In general, the Linux kernel does any device I/O via the page cache. When a process writes data, the data goes to the page cache, i.e., to RAM, and the Linux kernel takes care of writing it to the device asynchronously at a later point in time. The page cache operates with a granularity of 4-KiB pages. So, if two small writes hit the same 4-KiB page in rapid succession, these two writes will be coalesced into one 4-KiB write later, when the Linux kernel decides to asynchronously write the page to the underlying device. Between data in a page getting modified (in RAM) and the page actually getting written to disk, the page is said to be dirty.

    Reads also go through the page cache. When a process reads data from a device, the Linux kernel reads the 4-KiB pages that hold the data into the page cache, i.e., to RAM. From there, the Linux kernel copies the data to the read buffer provided by the process.

    The page cache uses least-recently-used eviction. The pages that contain data that was recently read or written are thus kept in RAM, so that subsequent reads hopefully won’t have to go to the underlying device, but will find their data already in the page cache.

    The page cache’s lifetime is bounded by the Linux kernel’s lifetime. The data will be safe even if a process crashes after writing data to the page cache but before the Linux kernel actually writes the data to disk. The page cache is system-wide and not bound to a process. The page cache only loses data when the kernel panics or if there is a sudden power loss, in other terms, when the kernel doesn’t get cleanly shut down. A clean OS shutdown will write all dirty pages to disk.

  • hardware caches - Disk devices and controllers can also contain caches. Again, the idea is to coalesce writes and to keep recently read or written data in RAM. The difference is just that for hardware caches, the RAM sits on the disk device or the controller. How exactly these caches work and which guarantees they come with differs from device to device. Sometimes the cache of a device can also be configured, i.e., it allows to select from a set of different behaviors. Some of the caches are battery-backed, so that a sudden power loss would not cause data loss, others aren’t.

Cache hierarchy:

The cache hierarchy uses the following 3 layers:

  • the current write block
  • the page cache
  • the hardware caches

All three of them can delay the persistence of written record data, temporarily keeping the data in RAM, where it can be affected by unexpected events such as a sudden loss of power. The post-write-queue doesn’t factor into this, as it only keeps already written data around, but doesn’t delay the data on its way to persistent storage.

These three buffers and caches form three layers of a hierarchy that written data moves through.

  • As the first layer, asd keeps data in the current write block (unless if commit-to-device is set to true).
  • When asd decides to actually persist a write block, the page cache comes into play as a second layer and may further delay persistence.
  • Once the Linux kernel decides to write the data from the page cache to the underlying device, the hardware caches come into play as a third layer and may further delay persistence.
  • Finally, the device’s firmware will decide to move the data from the hardware cache to persistent media. Only then will the data be safe from any unexpected events such as a sudden power loss.

Configuration options:

Here are the different configuration options available related to those different caches

Aerospike server version 4.3.1 and above:

  • By default, reads from and writes to devices use O_DIRECT and O_DSYNC. This bypasses the latter two layers in the three-layer cache hierarchy, the page cache and the hardware caches. However, data can still be lost in first layer, the current write block. This buffer loss window is bounded, though, by how often asd writes partial write blocks to the underlying device. Refer to the flush-max-ms configuration directive. Therefore, the buffer loss window (in case of a crash or power loss) can be quantified and controlled.

  • By default, reads from and writes to files don’t use any flags. Therefore, caching applies to both. All reads are cached in the page cache, and for writes data can theoretically be lost in the page cache or in the hardware caches. As with devices, data can also be lost in the first layer. However, while the loss window in the first layer is bounded (refer to the flush-max-ms configuration directive), there are no definitive bounds regarding the page cache or the hardware caches. So, this default has slightly worse guarantees than the default for devices. As mentioned previously in this article, though, data loss in the second and third layer requires a kernel panic or sudden power loss. If asd crashes, the data will be preserved in these layers, even with these somewhat lesser guarantees.

  • direct-files - This configuration directive enables O_DIRECT and O_DSYNC for files, i.e., it brings the parameters for files in line with the defaults for devices. Reads and writes now bypass the page cache and the hardware caches. Data can still be lost in the first layer, though, just like for devices which the next configuration directive addresses.

  • flush-max-ms - This directive configures the interval at which asd writes a partially filled current write block to device (in milliseconds). This reduces risk of buffer loss window in layer one of the cache hierarchy, the current write block, to the given millisecond interval. Note that this only applies to the first layer of the cache hierarchy. If used with files, but don’t use direct-files, then caching still happens in the page cache and in the hardware caches.

  • commit-to-device - This configuration directive takes flush-max-ms to its logical conclusion: synchronously write record data to the underlying device during a write transaction. In contrast to flush-max-ms, though, this affects all three layers of the cache hierarchy: if O_DIRECT and O_DSYNC aren’t enabled yet, this will enable them. For devices, O_DIRECT and O_DSYNC are enabled by default, so this aspect of commit-to-device only applies to files. This also means that the direct-files directive is not needed when using commit-to-device with files. In any case, this configuration directive disables caching in all three layers of the hierarchy.

  • read-page-cache - This configuration directive removes O_DIRECT and O_DSYNC for record reads done by transactions. This means that the read data will not only go to asd, but also into the page cache. If the same record gets read again by a subsequent transaction, it will not need to go to the device to get it, but the read will be satisfied from the page cache. This configuration directive doesn’t change anything about writes, i.e., it doesn’t affect any write guarantees established by the above configuration options, say, commit-to-device.

So, if reads are cached in the page cache, but writes bypass the page cache, won’t reads potentially read stale cached data? This is not an issue, because the Linux kernel guarantees page cache coherence. Even though in our scenario writes aren’t cached, a write of some data invalidates a cached copy of that data in the page cache. Therefore, writes don’t overwrite data in the page cache, but they do invalidate it, if needed. A subsequent read of the written data is thus forced to hit the device again and read the fresh data, not a stale version of it.

Aerospike server versions prior to 4.3.1:

  • flush-max-ms - This directive configures the interval at which asd writes a partially filled current write block to device (in milliseconds). This reduces risk of buffer loss window in layer one of the cache hierarchy, the current write block, to the given millisecond interval. Note that this only applies to the first layer of the cache hierarchy.

  • commit-to-device - This configuration directive takes flush-max-ms to its logical conclusion: synchronously write record data to the underlying device during a write transaction. In contrast to flush-max-ms, though, this affects all three layers of the cache hierarchy

  • fsync-max-sec - This was the precursor to the 4.3.1 introduced feature (direct-files). It worked in the spirit of flush-max-ms and configures the interval, in seconds, at which an fsync() system call is made to:

    • a) force the Linux kernel to write dirty pages from the page cache to the underlying devices
    • b) ask the underlying devices to store any received writes in their hardware caches on persistent media
  • enable-osync - This was the precursor to the use of the O_DSYNC flag. It globally enabled the use of O_SYNC for all reads and writes. (For the sake of this discussion, we will ignore the small difference between O_DSYNC and O_SYNC. The man page of the open system call can be referred to for details.) Before 4.3.1, O_DIRECT was only enabled by default for devices to bypass the page cache. Bypassing the third layer, the hardware caches, would have to be enabled manually via this configuration option.

  • disable-odirect - This was the precursor to read-page-cache. However, it affected reads as well as writes - and it affected all reads, even those by, say, the defragmentation background process. It simply disabled O_DIRECT globally. Therefore, it allowed for read caching, but it would come at the expense of write caching, i.e., written data could be lost in the page cache in case of a kernel panic or sudden power loss.

Notes

  • It is a good practice to benchmark any cache tuning in a staging or development environment prior to implementing in production.

Keywords

CACHING O_DIRECT O_DSYNC COMMIT-TO-DEVICE

Timestamp

10/16/2018