SSD operations



We have a setup where we are setting a key for a time-period and then appending strings on this key.

Data appended on this key is less than 1MB (write-block-size) for the time period, but we’re experiencing quite high I/O on SSD during these updates… and reading is abut 50% of the load.

We’re looking for clarification how does the append work inside Aerospike, especially how the SWB and write-q work…

If the record is appended with a 100 byte string, is the whole record read from the disk, appended in memory and then pushed again into SWB for writing to SSD? Or how does it work? We have these transactions going on simultaneously about a 1000 per second…

 is the whole record read from the disk, appended in memory and then pushed again into SWB for writing to SSD


To overcome this issue, people will sometimes just store it in memory; or use a composite key. myrec[1-#] then just batch read the records back… You could also store a reference of how many composite keys there are in a separate in-memory bin. Are you appending to the same record over and over? If so it’s best to just buffer these a bit and save them in one operation instead many.


How does it work? When you (client) ask to update (i.e., append to existing record) a record, client sends the [ digest (hash of your key + setname) + namespace ] --> (key object) to the server and the string you want to append - depending the bin type you are using - assuming its a list bin…

At the server: For the specified namespace, the server

1 - looks for the Primary Index of this record using the digest in RAM - (R-B tree search)

2 - from the primary index, it finds which device your record is stored on, at what offset, and how long is it (bytes). It retrieves the record in memory. (SSD–> READ OP)

3 - It then does whatever append operation you asked for on the record in memory.

4 - It then writes the record to the current write-block in RAM - the 1MB block that is being filled with new or updated records. It also changes the Primary Index of this record you are modifying to point to data in this different write-block.

5 - this write-block is flushed to device when full, asynchronously or partially every 1 second till full. (SSD - WRITE OP) The primary index now points to this new version of the record.
Note: We don’t update a record “in-situ” in the previous write-block where the old version of the record was.

5a - The write-block is placed in a write-q - normally the q will be zero depth - its just a buffer to allow the device write thread to deal with burst loads. Typically it go to write-q to --> device --> and then to another queue called the post-write-queue (depth 256 - FIFO) which allows for reading recently updated records, e.g., by XDR - cross data center replication feature.

6 - the device defrag thread will eventually recover the unpointed to space on the device of the old version of the record. This concept of defragging is really the server’s point of view in terms of the write-blocks its allocating which can be seen as “software” data structures. Not really what the SSD controller is doing at the SSD level. The SSD controller has its own world to deal with in terms of the blocks and pages that store the write-block data. From the SSD controller’s point of view, its storing these “write-block” data structures for the server, what it does underneath only the controller knows. The server code manages its view of these data structures.


Additional question for clarification: SWB holds multiple keys (and records) ? If so, and there is update for a record that is 999kB and there is a update for this record that puts it within 128bytes of the write-block limit and there is another record coming in with 256 bytes, will this swb be put to write-q immediatly and the next update to new SWB? Or are the multiple swb:s “open” at the same time?


Not sure exactly how the code handles it. It is my understanding that the last record that pushes the block over its size limit will cause the current block to be pushed to the write-q and this record will go in a new block. If this causes a mostly unfilled block to be written to device, thats OK - we will consolidate in the defrag process.

The defrag thread takes two blocks where unused part of the blocks is > 50% (thats the default value), and coalesces them into a new block, marking the previous two blocks as “free” to be overwritten by fresh blocks. That in essence is the defrag process.

write-blocks on device can get to < 50% useful when some records in that block get updated and written in different (new) blocks or due to uneven workload, we pushed down a < 50% full block (this part I am guessing, but I think I am right).

Also note, if you have multiple devices, a record always goes to the same device, nominally. [So there is a write-block in RAM per device]. The device id is algorithm-icly derived from the digest and the number of devices or drive partitions. So defrag works nicely at device level. Each device has its own defrag thread.


Still a question from an insisting collaeague :

SWB is shared by different records, there is no dedicated SWB for each record ?



there is no dedicated swb for a record - unless your records are > 50% size of your write-block - then we can only fit one record per write block - a write-block of typically 128KB size for modern SSDs gives the best IO performance (see ACT testing of SSDs) - and most people have typical records around 1KB. So there 100 or so different records in the same write-block - packed together - some people have records that are ~256 bytes - so many 100’s in that case. (this is why in-situ modification does not work)


This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.