Basics of SSDs
Aerospike delivers in memory performance with disk based persistence via the use of SSD drives. This paper intends to discuss some fundamental aspects of SSD drives that customers may wish to be aware of. This document is not intended to be a definitive description of all aspects of SSD operation.
SSD Interface Types
There are three ways to attach SSDs to a system, SATA, SAS and PCIe. Key differentiators are the number of paths to the disk which in turn define the I/O bandwidth the disk can deliver. Additionally, more paths imply greater redundancy as well as lower I/O latency.
SATA interfaces are cheap are also least performant, the interface is limited to 6Gb/s. SAS fares better as it supports multipathing via dual port interfaces, it can deliver up to 12Gb/s. For both SAS and SATA there are a number of layers in the software stack between the application and the data it accesses, this means that latency is, to an extent,‘built in’ to the model.
PCIe behaves in a way much more akin to memory in terms of how applications are able to access data and, as a consequence, latency is very low. Data moves along pairs of wires (one for transmission, one for reception) known as data lanes. A given PCIe slot can provide 1,4,8 or 16 data lanes. As the bandwidth for PCIe scales in a linear manner, a Base 2.0 PCIe connection, which delivers 4Gb/s, can yield total bandwidth of 64Gb/s with 16 data lanes. PCIe can be accessed using a standards based device driver, NVM Express (Non-Volatile Memory Host Controller Interface Specification) which is designed to work with the bandwidth delivered by PCIe interfaces. This leads to performance increases, not only through better bandwidth and higher capacity, also through a designed in ability to run in a highly parallel manner (example, no thread locking required)
The advantages to PCIe are offset by a higher unit cost which is around double that per Gb of SATA. Even when SATA or SAS devices are striped together for performance, due to the impact of software overhead, performance degrades as workload and thread count increase. In terms of cost per IOP, in a high workload scenario, PCIe can be cheaper overall due to the linear scaling characteristic.
Traditionally, the expected longevity or endurance of a given drive has been described in drive writes per day(DW/D) which is the amount of data that can be written during a given time period. The time period will correspond to the manufacturers warranty for that drive. An SSD with 1 Tb capacity and 3 years warranty should be expected to handle 1 Tb written to the drive every day for 3 years. When using DW/D it is important to check whether this refers to Total Flash Writes or Application Writes. Total Flash Writes is very much a best case scenario, it is useful to compare drives however it does not represent realistic usage patterns. Application Writes attempts to mirror the worst case scenario using small block, random I/O patterns (reads, writes, wear levelling, garbage collection). Application Writes use random, rather than sequential writes as these result in lower drive endurance.
Drives can also be measured in TBW which is Terabytes Written. There is a direct conversion between DW/D and TBW as follows:
TBW = DWD*Warranty Period(years)365Capacity/1024(for coversion to Tb)
When SSDs first came to market, write endurance was prioritised. As use cases have expanded, reads play more of a part. To reflect this, DW/D is augmented with notations as follows:
- HE - High Endurance
- ME - Medium Ensurance
- RI - Read Intensive
- VRI - Very Read Intensive
In addition to these ratings, easily available tools such as SMARTctl are able to give good information on current SSD status and longevity.
Flash memory can be classified as either NAND or NOR according to the characteristics of the internal cells (whether they are more like a NAND logic gate or NOR logic gate.). NAND type flash is preferred for Enterprise SSD due to lower write and erase times and much higher endurance than NOR.
Initially NAND technology used Single Level Cell (SLC) which uses a single cell to store a single bit of data as the cell can exist in two potential states. This provides excellent endurance for write heavy applications however it is very expensive. Over time manufacturers moved to Multi Level Cell (MLC) architectures which can exist in (usually) four states. These offer less endurance and worse performance but for a much lower price. To address the endurance issues, error handling and data protection are implemented at a controller level.
NAND cells can be made in a variety of ways, traditionally this would be done using MOSFET methods with doped polycrystalline silicon however some manufacturers are now using charge trap flash in which a silicon nitride film is used to trap electrons.
NAND is now moving to a new standard, 3D NAND which increases the amount of available logic gates by, in concept, folding the planar string of gates over, to create a 3D structure.
These concepts are important with respect to Aerospike usage of SSDs in that when choosing an SSD vendor, care must be taken to ensure that they have a strategy to manage the transition to 3D NAND as this is the current state of the art in terms of storage and performance.
As discussed, MLC architectures require overt error handling, this is usually done within the controller via software or firmware. The first line of defense is to try and avoid errors in the first place, this is achieved via pro-active cell management and signal processing at the controller level to manage how the NAND wear dynamically. Implementing these techniques has the effect of reducing the need for read retries by accessing error-free data. Other techniques such as predictive read opitmisation can further increase the performance of the drive.
Each individual NAND die comprises multiple pages within which are multiple blocks. The controller manages storage at this NAND block level. Data can be striped using software within the controller, this can include redundancy information. This is then spread across multiple blocks and channels to provide RAID like resiliency by making sure that no two blocks within a stripe reside in the same physical die.
There are 4 key metrics used to measure the performance of SSD drives, these are
- IOPS - This is the number of transactions that can be completed in a given amount of time (1 second). In practical terms this is the transfer rate of the device.
- Throughput - This is the amount, in data, that can be transferred on and off the device in a given amount of time. Typical units are Mb/s and Gb/s
- Latency - This is the amount of time taken for a command to execute from leaving the host, to the disk and back again, the round trip time for an IO request.
- QoS - Quality of Service is the performance to, or above, a given confidence threshold over time. The threshold can be defined in combinations of other metrics, the key is that this is maintained for a defined time period.
Given the effect different workloads can have on the performance of an SSD, any metrics must be tied to a specific use case. Factors such as block size and access pattern (random or sequential) can induce marked changes in performance as can read/write mix. Sequential operations are when the starting Logical Block Address follows directly on from the previous operation, in effect, each new IO begins where the previous IO left off. Random operations are where the LBA does not lead off from where the previous IO finished and so are much harder work for the disk.
The mix of read and write operations can also have a big impact on performance. Reads on an SSD are always quick as there are very few operations required for the controller to take to get the data. Writes on an SSD are more complex. On a spinning disk, a block can be overwritten with new data in a single I/O operation. This is not possible with SSD disks. The number of steps for a write operation is a function of how full the disk is and whether the target cell must have data relocated or erased. In a general sense, SSD drives can deliver high IOPS in random access patterns and high throughput in sequential access.
Typically, published figures for SSDs will give IOPS performance with 4k block sizes and 100% read or 100% write workloads. Throughput is generally quoted with a 100% read workload and 128k block size. These do not represent real world usage patterns.
The amount of CPUs required in the host, and how hard these CPUs have to be worked to deliver a given level of performance is contingent upon IOPS, throughput and, most of all, latency. Most SSD datasheets publish a latency figure based on 100% 4k random reads, this is not a realistic workload though it can be used for comparative purposes.
Quality of service takes into account measurement of all of the above metrics but, by introducing a time dimension, allows controller tasks like wear levelling and garbage collection to be included in the measurement. Some manufacturers publish QoS metrics but these are not standardised.
The Aerospike ACT tool can be used to determine the actual performance of a given drive with a realistic workload and over a realistic time period. Benchmark figures for many common drives are already published. The tool is available for customers to test themselves. In this manner it gives a realistic Quality of Service measure.
SSDs can be categorised by mode of attachment. PCIe offers clear benefits in terms of performance and scaling but at a higher initial cost. There are various measures for manufacturers to publish endurance however the specifics of the metric should be considered when deciding upon a drive type. There are a number of methods of increasing drive resilience and error correction, these are largely software based and vary between manufacturers. When measuring performance, workload is critical. If possible a use case specific tool such as Aerospike ACT should be employed to give proper data on how the drive will perform in the task it is intended for.
- Aerospike ACT
AEROSPIKE SSD BASICS NAND